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Introduction 


Who is this book for? 


This book is primarily targeted to programmers or learners who want 
to learn R programming for statistics. This book will cover using R 
programming for descriptive statistics, inferential statistics, regression 


analysis, and data visualizations. 


How is this book structured? 


The structure of the book is determined by following two requirements: 


e This book is useful for beginners to learn R 


programming for statistics. 


e This book is useful for experts who want to use this 
book as a reference. 


Topic Chapters 
Introduction to R and R programming fundamentals 1 to 3 
Descriptive statistics, data visualizations, inferential statistics, 4 to 6 


and regression analysis 


Contacting the Author 


More information about Eric Goh can be found at www. svbook. com. He can 
be reached at gohminghui88Qsvbook. com. 


CHAPTER 1 


Introduction 


In this book, you will use R for applied statistics, which can be used in the 
data understanding and modeling stages of the CRISP DM (data mining) 
model. Data mining is the process of mining the insights and knowledge 
from data. R programming was created for statistics and is used in 
academic and research fields. R programming has evolved over time and 
many packages have been created to do data mining, text mining, and data 
visualizations tasks. R is very mature in the statistics field, so it is ideal to 
use R for the data exploration, data understanding, or modeling stages of 
the CRISP DM model. 


What Is R? 


According to Wikipedia, R programming is for statistical computing 
and is supported by the R Foundation for Statistical Computing. The R 
programming language is used by academics and researchers for data 
analysis and statistical analysis, and R programming’s popularity has risen 
over time. As of June 2018, R is ranked 10th in the TIOBE index. The TIOBE 
Company created and maintains the TIOBE programming community 
index, which is the measure of the popularity of programming languages. 
TIOBE is the acronym for "The Importance of Being Earnest." 

Ris a GNU package and is available freely under the GNU General 
Public License. This means that R is available with source code, and you 
are free to use R, but you must adhere to the license. R is available in the 
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command line, but there are many integrated development environments 
(IDEs) available for R. An IDE is software that has comprehensive facilities 
like a code editor, compiler, and debugger tools to help developers write R 
scripts. One famous IDE is RStudio, which assists developers in writing R 
scripts by providing all the required tools in one software package. 

Ris an implementation of the S programming language, which 
was created by Ross Ihahka and Robert Gentlemen at the University of 
Auckland. R and its libraries are made up of statistical and graphical 
techniques, including descriptive statistics, inferential statistics, and 
regression analysis. Another strength of R is that it is able to produce 
publishable quality graphs and charts, and can use packages like ggplot for 
advanced graphs. 

According to the CRISP DM model, to do a data mining project, you 
must understand the business, and then understand and prepare the 
data. Then comes modeling and evaluation, and then deployment. R is 
strong in statistics and data visualization, so it is ideal to use R for data 
understanding and modeling. 

Along with Python, R is used widely in the field of data science, 
which consists of statistics, machine learning, and domain expertise or 


knowledge. 


High-Level and Low-Level Languages 


A high-level programming language (HLL) is designed to be used by a 
human and is closer to the human language. Its programming style is 
easier to comprehend and implement than a lower-level programming 
language (LLL). A high-level programming language needs to be converted 
to machine language before being executed, so a high-level programming 
language can be slower. 

A low-level programming language, on the other hand, is alot closer to 
the machine and computer language. A low-level programming language 


can be executed directly on computer without the need to convert 


Z 
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between languages before execution. Thus, a low-level programming 
language can be faster than a high-level programming language. Low-level 
programming languages like the assembly language are more inclined 
towards machine language that deals with bits 0 and 1. 

Ris a HLL because it shares many similarities to human languages. For 


example, in R programming code, 


> Vari <- 1; 

> var2 <- 2; 

> 

> result «- vari + var2; 
» print(result) 

[1] 3 

> 


The R programming code is more like human language. A low-level 
programming language like the assembly language is more towards the 
machine language, like 0011 0110: 


0x52ac87: movl7303445 (%ebx), ^eax 
0x52ac78: calll Ox6bfb03 


What Is Statistics? 


Statistics is a collection of mathematics to deal with the organization, 
analysis, and interpretation of data. Three main statistical methods are 
used in the data analysis: descriptive statistics, inferential statistics, and 
regressions analysis. 

Descriptive statistics summarizes the data and usually focuses on 
the distribution, the central tendency, and the dispersion of data. The 
distribution can be normal distribution or binomial distribution, and the 
central tendency is to describe the data with respect to the central of the 
data. The central tendency can be the mean, median, and mode of the 
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data. The dispersion describes the spread of the data, and dispersion can 
be the variance, standard deviation, and interquartile range. 

Inferential statistics tests the relationship between two data sets or 
two samples, and a hypothesis is usually set for the statistical relationships 
between them. The hypothesis can be a null hypothesis or alterative 
hypothesis, and rejecting the null hypothesis is done using tests like the 
T Test, Chi Square Test, and ANOVA. The Chi Square Test is more for 
categorical variables, and the T Test is more for continuous variables. The 
ANOVA test is for more complex applications. 

Regression analysis is used to identify the relationships between two 
variables. Regressions can be linear regressions or non-linear regressions. 
The regression can also be a simple linear regression or multiple linear 
regressions for identifying relationships for more variables. 

Data visualization is the technique used to communicate or present 
data using graphs, charts, and dashboards. Data visualizations can help us 


understand the data more easily. 


What Is Data Science? 


Data science is a multidisciplinary field that includes statistics, computer 
science, machine learning, and domain expertise to get knowledge 

and insights from data. Data science usually ends up developing a data 
product. A data product is the changing of the data of a company into a 
product to solve a problem. 

For example, a data product can be the product recommendation 
system used in Amazon and Lazada. These companies have a lot of data 
based on shoppers’ purchases. Using this data, Amazon and Lazada can 
identify the shopping patterns of shoppers and create a recommendation 
system or data product to recommend other products whenever a shopper 
buys a product. 
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The term "data science" has become a buzzword and is now used to 
represent many areas like data analytics, data mining, text mining, data 
visualizations, prediction modeling, and so on. 

The history of data science started in November 1997, when C. F. 

Jeff Wu characterized statistical work as data collection, analysis, and 
decision making, and presented his lecture called "Statistics - Data 
Science?" In 2001, William S. Cleveland introduced data science as a field 
that comprised statistics and some computing in his article called "Data 
Science: An Action Plan for Expanding the Technical Area of the Field of 
Statistics." 

DJ Patil, who claims to have coined the term "data science" with Jeff 
Hammerbacher and who wrote the "Data Scientist: The Sexiest Job of the 
21st Century" article published in the Harvard Business Review, says that 
there is a data scientist shortage in many industries, and data science is 
important in many companies because data analysis can help companies 
make many decisions. Every company needs to make decisions in strategic 
directions. 

Statistics is important in data science because it can help analysts or 
data scientists analyze and understand data. Descriptive statistics assists in 
summarizing the data, inferential statistics tests the relationship between 
two data sets or samples, and regression analysis explores the relationships 
between multiple variables. Data visualizations can explore the data 
with charts, graphs, and dashboards. Regressions and machine learning 
algorithms can be used in predictive analytics to train a model and predict 
a variable. 

Linear regression has the formula y = mx + c. You use historical data 
to train the formula to get the m and c. Y is the output variable and x is the 
input variable. Machine learning algorithms and regression or statistical 
learning algorithms are used to predict a variable like this approach. 

Domain expertise is the knowledge of the data set. If the data set 
is business data, then the domain expertise should be business; if it 
is university data, education is the domain expertise; if the data set is 
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healthcare data, healthcare is the domain knowledge. I believe that 
business is the most important knowledge because almost all companies 
use data analysis to make important strategic business decisions. 

Adding in product design and engineering knowledge takes us into the 
fields of Internet of Things (IoT) and smart cities because data science and 
predictive analytics can be used on sensor data. Because data science is 
a multidisciplinary field, if you can master statistics, machine e-learning, 
and business knowledge, it is extremely hard to be replaced. You can also 
work with statisticians, machine learning engineers, or business experts to 
complete a data science project. 

Figure 1-1 shows a data science diagram. 


Source: Palmer, Shelly. Data Science for the C-Suite. 
New York: Digital Living Press, 2015. Print. 





Figure 1-1. Data science is an intersection 


What Is Data Mining? 


Data mining is closely related to data science. Data mining is the process 
of identifying the patterns from data using statistics, machine learning, and 
data warehouses or databases. 
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Extraction of patterns from data is not very new, and early methods 
include the use of the Nayes theorem and regressions. The growth of 
technologies increases the ability in data collection. The growth of 
technologies also allows the use of statistical learning and machine 
learning algorithms like neural networks, fuzzy logic, decision trees, 
generic algorithms, and support vector machines to uncover the hidden 
patterns of data. Data mining combines statistics and machine learning, 
and usually results in the creation of models for making predictions based 
on historical data. 

The cross-industry standard process of data mining, also known as 
CRISP-DM, is a process used by data mining experts and it is one of the 
most popular data mining models. See Figure 1-2. 


Business Data 
Understanding Understanding 
Data 
Preparation 


Modeling 









Deployment 





Figure 1-2. Cross-industry standard process for data mining 
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The CRISP-DM model was created in 1996 and involves SPSS, 
teradata, Daimler AG, NCR Corporation, and OHRA. The first version 
was depicted at the fourth CRISP-DM SIG Workshop in Brussels in 1999. 
Many practitioners use the CRISP-DM model, but IBM is the company that 
focuses on the CRISP-DM model and includes it in SPSS Modeler. 
However, the CRISP-DM model is actually application neutral. The 
following sections explain its constituent parts. 


Business Understanding 


Business understanding is when you understand what your company 
wants and expects from the project. It is great to include key people in the 
discussions and documentation to produce a project plan. 


Data Understanding 


Data understanding involves the exploration of data that includes the use 
of statistics and data visualizations. Descriptive statistics can be used to 
summarize the data, inferential statistics can be used to test two data sets 
and samples, and regressions can be used to explore the relationships 
between multiple variables. Data visualizations use charts, graphs, and 
dashboards to understand the data. This phase allows you to understand 
the quality of data. 


Data Preparation 


Data preparation is one of the most important and time-consuming 
phases and can include selecting a sample subset or variables selection, 
imputing missing values, transforming attributes or variables including 
log transform and feature scaling transformation, and duplicates removal. 
Variables selection can be done with a correlation matrix in a data 


visualization. 
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Modeling 


Modeling usually means the development of a prediction model to predict 
a variable in data. The prediction model can be developed using regression 
algorithms, statistical learning algorithms, and machine learning 
algorithms like neural networks, support vector machines, naive Bayes, 
multiple linear regressions, decision trees, and more. You can also build 


prescriptive and descriptive models. 


Evaluation 


Evaluation is one of the phases where you may use ten-fold crossover 
validation techniques to evaluate the precision and recall of your model. 
You may improve your model accuracy by moving back to the previous 
phase to improve or prepare your data more. You may also select the most 
accurate model for your requirements. You may also evaluate the model 
using the business success criteria established in the beginning stage, 
which is the business understanding stage. 


Deployment 


Deployment is the process of using new insights and knowledge to 
improve your organization or make changes to improve your organization. 
You may use your prediction model to create a data product or to produce 
a final report based on your models. 


What Is Text Mining? 


While data mining is usually used to mine out patterns from numerical 
data, text mining is used to mine out patterns from textual data like Twitter 
tweets, blog postings, and feedback. Text mining, also known as text data 
mining, is the process of deriving high quality semantics and knowledge 
from textual data. 


CHAPTER 1 INTRODUCTION 


Text mining tasks may consist of text classification, text clustering, and 
entity extraction; text analytics may include sentiments analysis, TF-IDF, 
part-of-speech tagging, name entity recognizing, and text link analysis. 

Text mining uses the same process as the data mining CRISP-DM 
model, with slight differences as shown in Figure 1-3. 


e Performance and 
Utility Assessment 
Feedback Loop e Feedback Loop 


ur 






e Acquisition e Transformation e Discover 
e Cleaning e Fxtract e Presentation 
e Organize Knowledge e Interaction 


Figure 1-3. Text mining 


Data Acquisition 


Data acquisition is the process of gathering the textual data, combining the 
textual data, and doing some text cleaning. The business understanding 
stage may also be included here. 


Text Preprocessing 


Text preprocessing includes the process of porter stemming; stopwords 
removal; conversion of uppercase, lowercase, and propercase; extraction 
of words or tokens based on name entity or regular expressions; and 
transforming of text to vector or numerical forms. Text preprocessing is like 
the data preparation phase in CRISP-DM. 
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Modeling 


Text analytics or text discovery is the use of part-of-speech tagging or 
name entity recognition to understand each document. It implements 
sentiment analysis to understand the sentiments of the documents and 
text link analysis to summarize all documents in text links. Some books 
may call text analytics as text mining; I think text analytics is similar to data 
understanding. 

Modeling can also be the process of creating prediction models such 
as text classification. Some books may put the data mining process in this 
stage to create prediction models, descriptive models, and prescriptive 
models, after converting the text to vectors in the text preprocessing stage. 


Evaluation/Validation 


Evaluation or validation is the process of evaluating the accuracy of the 
model created. You can view this as the evaluation stage of the CRISP-DM 


model. 


Applications 


The applications stage is the deployment stage in the CRISP-DM model, 
where presentations or a full report are developed. You may also develop 
the model into a recommendation and classification system as a data 
product. 


Natural Language Processing 


Natural language processing (NLP) is an area of machine learning 
and computer science used to assist the computer in processing and 
understanding natural language. NLP can include part-of-speech tagging, 
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parsing, porter stemming, name entity recognition, optical character 
recognition, sentiment analysis, speech recognition, and more. NLP works 
hand in hand with text analytics and text mining. 

The history of NLP started in the 1950s when Alan Turing published 
an article called "Computing Machinery and Intelligence.’ Some notable 
natural language processing software was developed in the 1960s, such as 
ELIZA, which provided human-like interactions. In the 1970s, software was 
developed to write ontologies. In the 1980s, developers introduced Markov 
models and initiated research on statistical models to do POS tagging. 
Recent research has concentrated on supervised and semi-supervised 
algorithms and deep learning algorithms. 


Three Types of Analytics 


Selecting the type of analytics can be difficult and challenging; luckily, 
analytics can be categorized into descriptive analytics, predictive analytics, 
and prescriptive analytics. No analytic type is better than the others, but 
they can be combined with each other. 


e Descriptive Analytics: Uses data analytics to know 
what happened. 


e Predictive Analytics: Uses statistical learning and 
machine learning to predict the future. 


e Prescriptive Analytics: Uses simulation algorithms to 
know what should be done. 


Descriptive Analytics 


Descriptive analytics uses statistics to summarize the data using 
descriptive statistics, inferential statistics to test the two data sets and 
samples, and regression analysis to study the relationships between 


multiple variables. 
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Predictive Analytics 


Predictive analytics predicts a variable by implementing machine learning 
and statistical learning algorithms. In statistics, regressions can be used to 
predict a variable. For example, y = mx + c. You can determine m and c by 
training a linear regression model using historical data. Y is the variable to 
predict, x is the input variable. If you put in x value, you can predict the y. 


Prescriptive Analytics 


This is a field that allows a user to find the number of inputs to get a 

certain outcome. In simple form, this kind of analytics is used to provide 
advice. For example, y = mx + c. You have the m and c values. You want a 

y outcome, so what value should you put into x? To get the x value, what 
kind of things does your company need to do or what kind of advice do you 
need to give to the company? If you have multiple linear regressions, there 
are many x variables, so you need some simulation or evolutionary search 


algorithm to get the x values. 


Big Data 


Big data is data sets that are very big and complex for a computer to 
process. Big data has challenges that may include capturing data, data 
storage, data analysis, and data visualizations. There are three properties 
or characteristics of big data. 


Volume 


People are now more connected, so there are many more data sources, 
and as a consequence, the amount of data increased exponentially. The 
increase of data requires more computing power to process and analyze it. 
Traditional computing power is not able to process and analyze this data. 
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Velocity 


The speed of data is increasing and the speed of data coming in is so fast 
that it is very difficult to process and analyze the data. Tradition computing 
methods can't process and analyze at the speed of data coming in. 


Variety 


More sources means more data in different formats and types, such as 
images, videos, voice, speech, textual data, and numerical data, both 
unstructured and structured. Various data formats require different 
methods to extract the data from them. This means that the data is difficult 
to process and analyze, and traditional computing methods can't process 
such data. 

Data grows very quickly, due to IoT devices like mobile devices, 
wireless sensor networks, and RFID readers. Based on an IDC report, 
global data will increase from 4.4 zettabytes to 44 zettabytes from 2013 to 
2020. 

Relational databases and desktop statistics and data science software 
have challenges to process and analyze big data. Hence, big data requires 
parallel and distributed systems like Hadoop and Apache Spark to process 
and analyze the data. 

Two popular systems or frameworks for big data are Apache Spark 
and Hadoop. Hadoop is a distributed data systems to store big data across 
different cluster and computers. One cluster can have many computers. The 
Hadoop storage system is known as the Hadoop Distributed System (HDFS). 
Hadoop has many ecosystems, such as mahout to do machine learning 
processing. Hadoop also has processing systems, such as MapReduce. 

Apache Spark is a data processing system to process data on 
distributed data. Apache Spark does not have a file storage system, so it 
needs to integrate into a system like Hadoop. Apache Spark is a lot faster 
and completes full data analytics, data processing, and data prediction. 
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R, Python, and Java can interface with these Hadoop and Apache Spark 


systems. 


Why R? 


When learning data science, many people struggle with choosing which 
programming languages and data sciences to learn. There are many 
programming languages available for data science, like R, Python, SAS, 
Java, and more. There are many data science software packages to learn, 
such as SPSS Statistics, SPSS Modeler, SAS Enterprise Miner, Tableau, 
RapidMiner, Weka, GATE, and more. 

I recommend learning R for statistics because it was developed for 
statistics in the first place. Python is a real programming language, so 
you can develop real applications and software via Python programming. 
Hence, if you want to develop a data product or data application, Python 
can be a better choice. R programming is very strong in statistics, so it 
is ideal for data exploration or data understanding using descriptive 
statistics, inferential statistics, regression analysis, and data visualizations. 
R is also ideal for modeling because you can use statistical learning like 
regressions for predictive analytics. R also has some packages for data 
mining, text mining, and machine learning like Rattle, CARET, and TM. R 
programming can also interface with big data systems like Apache Spark 
using Sparklyr. SAS programming is commercial, and Java has direct 
interfaces with GATE, Stanford NLP, and Weka. SPSS Statistics, SPSS 
Modeler, SAS Enterprise Miner, and Tableau are data science software 
packages with GUIs and are commercial. RapidMiner, Weka, and GATE are 
open source software packages for data science. 

R is also heavily used in many of the companies that hire data 
scientists. Google and Facebook have data scientists who use R. R is also 
used in companies like Bank of America, Ford, Uber, Trulia, and more. 
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R is also heavily used in academia, and R is very popular among 
academic researchers, who can use R graphics for publications. 

Scripts written in R can be used on different operating systems, 
including Linux, Apple, and Windows, as long as the R interpreter is 
installed. This is not possible with languages like C7. 


Conclusion 


In this chapter, you looked into R programming. You now know that R 
programming is a programming language for statistical computing and is 
supported by the R Foundation for Statistical Computing. The R language 
is used by researchers for statistical analysis, and R popularity has 
increased every year. 

I also discussed high-level programming languages and low-level 
programming languages. HLLs are designed to be used by humans and are 
closer to the human language. LLLs, on the other hand, are a lot closer to 
the machine and computer languages. LLLs can be executed directly on a 
computer without the need to convert between languages, so they can be 
faster. 

I also discussed statistics. Statistics is a collection of mathematics to 
deal with the organization, analysis, and interpretation of data. There are 
three main statistical methods used in data analysis: descriptive statistics, 
inferential statistics, and regressions analysis. 

I also discussed data science. Data science is a multidisciplinary field 
that includes statistics, computer science, machine learning, and domain 
expertise to get knowledge and insights from data. Data science usually 
ends up with the development of a data product. A data product is the 
changing of the data of a company into a product to solve a problem. 

Data mining is closely related to data science. Data mining is the 
process of identifying patterns from data using statistics, machine learning, 
and data warehouses or databases. Data mining consists of many models; 
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CRISP-DM is the most popular model for data mining. In CRISP-DM, data 
mining comprises business understanding, data understanding, data 
preparation, modeling, evaluation, and deployment. 

While data mining is usually used to mine out patterns from numerical 
data, text mining is used to mine out patterns from textual data like Twitter 
tweets, blog postings, and feedback. Text mining, also known as text data 
mining, is the process of deriving high quality semantics and knowledge 
from textual data. Text mining consists of text classification, text clustering, 
and entity extraction; text analytics may include sentiments analysis, 
TF-IDE part-of-speech tagging, name entity recognizing, and text link 
analysis. Text mining uses the same process as the data mining CRISP-DM 
model, with slight differences. 

Natural language processing is an area of machine learning and 
computer science used to assist the computer in processing and 
understanding natural language. NLP can include part-of-speech tagging, 
parsing, porter stemming, name entity recognition, optical character 
recognition, sentiment analysis, speech recognition, and more. NLP works 
hand in hand with text analytics and text mining. 

Selecting the types of analytics can be difficult and challenging. 
Luckily, analytics can be categorized into descriptive analytics, predictive 
analytics, and prescriptive analytics. No one type of analytics is better than 
the others, but they can be combined with each other. 

Big data is data sets that are very big and complex for a computer to 
process. Big data has challenges that may include capturing data, data 
storage, data analysis, and data visualizations. There are three properties 
of big data: volume, velocity, and variety. There are two popular systems or 
frameworks for big data: Hadoop and Apache Spark. 

When learning data science, there are many programming languages, 
like R, Python, SAS, and Java. There are many data science software 
packages, such as SPSS Statistics, SPSS Modeler, SAS Enterprise Miner, 
RapidMiner, and Weka. R was developed with statistics in mind, so it is 
best for the statistics portion of data mining, such as data understanding, 
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modeling with statistical learning algorithms, and data visualizations. 

R has packages for machine learning, natural language processing, and 
text mining, and Apache Spark for big data. Python is a full programming 
language, and it is best for developing data product or software. The 

SAS programming language is commercial and not free. R has become 
very popular, according to the TIOBE ranking, and many companies like 
Facebook and Google have data scientists who use R. R is also very popular 
with academic researchers. R scripts or code can be run on different 
operating systems long as the R interpreter is installed. 
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CHAPTER 2 


Getting Started 


R programming is a programming language with object-oriented features 
ideal for statistical computing and data visualizations. R programming 
can do descriptive statistics, inferential statistics, and regression analysis. 
R programming is a GNU package and is a command line application. 
RStudio is an integrated development environment (IDE) for R 
programming. An IDE offers features to help you write code more easily 
and more productively by providing a code editor, compiler, and debugger. 
The code editor usually has syntax highlighting and intelligent code 
completion. 

In this chapter, you will explore the R programming command line 
application and the RStudio IDE, and you will install R and RStudio on 
your computer. You will look into what an IDE is and you will explore the 
RStudio interface. RStudio and R can read a .csv file easily, perform some 
descriptive statistics, and plot simple graphs. 


What Is R? 


R programming is for statistical computing and is supported by the R 
Foundation for Statistical Computing. R programming is used by many 
academics and researchers for data and statistical analysis, and the 


popularity of R has risen over time. 
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Ris a GNU package and is available under the GNU General Public 
License, which can be assumed to be free to a certain extent and is open 
source. R is available in a command line application, as shown in Figure 2-1. 
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Figure 2-1. The RGui interface 


R programming is an implementation of the S programming language, 
its libraries consist of statistical and data visualization techniques, and it 
can conduct descriptive statistics, inferential statistics, and regressions 
analysis. You will explore the differences between the R programming 
command line application and the RStudio IDE, as well as the basics of the 
descriptive statistics features and the data visualization features. 


The Integrated Development Environment 


An IDE is a software application that helps programmers develop 
software more easily and more productively. An IDE is made up of a code 
editor, compiler, and debugger tools. Code editors usually offer syntax 
highlighting and intelligent code completion. 
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Some IDEs, like NetBeans, also have an interpreter and others, like 
SharDevelop, don’t. Some IDEs have a version control system and tools 
like a graphical user interface (GUI) builder, and many IDEs have class and 
object browsers. 

IDEs are developed to increase the productivity of the developer 
by combining features like a code editor, compiler, debugger, and 
interpreter. This is different from a programming code text editor like 
VI and NotePad++, which offer syntax highlighting but usually don’t 
communicate with the debugger and compiler. 

The beginning of IDEs can be traced back to when punched cards were 
submitted to the compiler in early systems. Dartmouth BASIC was the 
first programming language to be created with an IDE. Maestro I was later 
created by Softlab Munich and can be considered the first full IDE between 
1970s and 1980s. Maestro I can be found in the Museum of Information 
Technology at Arlington, Virginia. The Softbench IDE was later created 
to have plugins. Today, Visual Studio, NetBeans, and Eclipse are the most 
famous IDEs. The R programming IDE is RStudio, and Figure 2-2 shows its 
intelligent code completion. 
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Figure 2-2. RStudio IDE intelligent code completion 
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RStudio: The IDE for R 


In R programming, RStudio is the most popular IDE. RStudio has a code 
editor that consists of syntax highlighting and intelligent code completion 
functions. RStudio also has a workspace showing all the variables and 
history. You may double-click the variables to view them using tables and 
other options. 

The R console is in RStudio so you can view the results of the R scripts 
after running the scripts; you can also type into the R console with R code 
to do some simple computing. The Plots and Others portion is available in 
RStudio to let you view the charts and graphs plotted from R scripts. The 
Plots and Others portion allows you to easily save the graphs and charts. 
Figure 2-3 shows the RStudio IDE interface. 
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Figure 2-3. RStudio IDE interface 


Installation of R and RStudio 


In order to code R scripts, you must install the R programming command 
line application. You can download the R programming command line 
application from www. r-project.org/, as seen in Figure 2-4. 
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Figure 2-4. The R project website 


In this book, you will download R for Windows. You can also download 
for Linux and Mac OS, as seen in Figure 2-5. 
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Figure 2-5. Downloading the R base for different OS options 


To install the software, double-click the download setup file and follow 
the instructions of the installer to install the R programming command 
line application, as seen in Figure 2-6. 
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Figure 2-6. Installation of R 


After the R programming command line application is installed, you 


can start it, as seen in Figure 2-7. 


BR RGu (64-bit) 
File Edit View Mic Packages Windows Help 


[s] Tu) (Jets) [9] 8] 


R version 3.5.1 (2018-07-02) -- "Feather Spray” 
Copyright (C) 2018 The R Foundation for Statistical Computing 
Platform: x96 (4-w64-mingw22/x64 (64-bit) 


R is free software and comes with ABSOLUTELY NO WARRANTY. 

You are welcome to redistribute it under certain conditions, 

Type 'license()" or 'licence()"' for distribution details. 
Natural language support but running in an English locale 

R is à collaborative project with many contributors. 

Type 'contributors()' for more information and 

'eítetion()" om how to cite R or R packages in publications. 

Type 'jSemo()" for some demos, ‘helpi)* for on-line help, or 

'help.start()* for en HTML browser interface to heip. 

Type 'q()' to quit R. 


>] 


Figure 2-7. The RGui interface 
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You can create your own Hello World application by using the print () 
function. The Hello World application is the standard first application to 


be developed when learning a programming language. Type the following 
code into the RGui: 


print("Hello World"); 


The print() function is used to print some text on the console screen. 
You may print any text other than the "Hello World" shown in Figure 2-8. 
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^ print ("Hello World"); 
[1] "Helle World" 





Figure 2-8. The R *Hello World" application 


RStudio is the most popular IDE for the R programming language. 
RStudio helps you write R programming code more easily and more 


productively. To download and install RStudio, visit www. rstudio.com/, as 
seen in Figure 2-9. 


25 


CHAPTER 2 GETTING STARTED 


h Thir i TRUE lla, xc The Corpegherties ka xj D Edie - Cpm uwor x ` E - a x 


+e 






- 
Estudio Products  — Resources Pricing — AboutUs — Blogs Q 


Download RStudio 


RStudio IÍ— 


Open source and enterprise-ready shinyapps.io Login 
professional software for R 


Discover RStudio Connect 





a D | 


Figure 2-9. The RStudio IDE website 


Download the latest version. For this book, you will download the 64-bit 
Windows version. After downloading the RStudio installer or setup file, 
double-click the file to install the RStudio IDE, as seen in Figure 2-10. 
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Figure 2-10. Installation of the RStudio IDE 
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After installing the RStudio IDE, you can run the RStudio IDE software, 
as seen in Figure 2-11. 
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Figure 2-11. The RStudio IDE interface 


Before running the script, you need to select the R programming 
command line application version to use. Click Tools > Global Options, 
as seen in Figure 2-12. 
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Addins k 


Keyboard Shortcuts Help Alte Shite K 
Modify Keyboard Shortcuts... 


Project Options. 


Figure 2-12. The RStudio IDE'S Tools menu 
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Click the Change button to select the R version, as seen in Figure 2-13. 


Options 





.— General 
Code 
7 Appearance 
Pane Layout 
| Packages 
Q R Markdown 
QG - Sweave 
‘Spelling 
DW Gisvn 
Èp Publishing 


m Terminal 


| Reversion: 


[64-bit] D:\PortableApps\R-Portable3.4.4\App\R-Portable 





Default working directory (when not in a project): 
e Browse... 


4 Re-use idle sessions for project links 
4 Restore most recently opened project at startup 


/ Restore previously open source documents at startup 
4 Restore .RData into workspace at startup 


Save workspace to .RData on ext: | Ask y 
4 Always save history (even when not saving .RData) 
Remove duplicate entries in history 


Show .Last.value in environment listing 
4. Use debug error handler only when my code contains errors 
Automatically expand tracebacks in error inspector 


Wrap around when navigating to previous/next tab 


4. Automatically notify me of updates to RStudio 


OK Cancel 


Apply 


Figure 2-13. RStudio IDE options 


For the beginner, choose the R version shown in Figure 2-14. If you 


want to change the R version in the future, you can use this method to 


do so. 
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Q Choose R Installation 





RStudio requires an existing installation of R in order to work. Please select 
the version of R to use. 


©) Use your machine's default version of R64 (64-bit) 

©) Use your machine's default version of R (32-bit) 

(€) Choose a specific version of R: 
[64-bit] C:\Program Files\R\R-3.5.1. 
[32-bit] C:\Program Files\R\R-3.5.1 
[64-bit] D:\PortableApps\R-Portable3.4.4\App\R-Portable 
[32-bit] D:\PortableApps\R-Portable3.4.4\App\R-Portable 
[64-bit] C:\Program Files\R\R-3.4.2 
[32-bit] C:\Program Files\R\R-3.4.2 





Figure 2-14. The Choose R Installation dialog 


After clicking OK and choosing the R version, you must restart the 
RStudio IDE, as depicted in Figure 2-15. 


Options 
c ; R version: 
Aoin [64-bit] C:\Program Files\R\R-3.5.1 Change... 
Code ee : rg 
Default working directory (when not in a project): 
3 Appearance T | Browse... 
hoistsssdut 4 Re-use idle sessions for project links 
€ 4 Restore most recently opened project at startup 
S P 4 Restore previously open source documents at startup 






L Git/SVN 4 Use debug error handler only when my code contains errors 
Automatically expand tracebacks in error inspector 
lèp Publishing 


Wran around when naviaatina to nrevious/next tab 


Figure 2-15. RStudio IDE R version changing 
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After restarting RStudio, the Console tab should show the selected R 


version, as seen in Figure 2-16. 


1:1 (Top Level) + 


Console Terminal 


R version 3.5.1 (2018-07-02) -- "Feather Spray" 

Copyright (C) 2018 The R Foundation for Statistical Computing 
Platform: x86, 64-w64-mingw32/x64 (64-bit) 

R is free software and comes with ABSOLUTELY NO WARRANTY. 
You are welcome to redistribute it under certain conditions. 
Type 'license()' or 'licence(Q)' for distribution details. 

R is a collaborative project with many contributors. 

Type 'contributors()' for more information and 

'citation()' on how to cite R or R packages in publications. 
Type 'demo()' for some demos, 'help()' for on-line help, or 
'help.start()' for an HTML browser interface to help. 

Type 'qO' to quit R. 


[workspace loaded from —/.RData] 


> | 


Figure 2-16. Changed R version 


Writing Scripts in R and RStudio 


You can read a comma-separated values (CSV) file using the read.csv() 
function in the R programming language, as seen in Figure 2-17. 


myData «- read.csv(file="D:/data.csv", header-TRUE, sep=","); 
myData; 


30 


CHAPTER 2 GETTING STARTED 


GR RGui (64-bit) 
File Edit View Misc Packages Windows Help 


eee elel es] 





R version 3.5.1 (2018-07-02) -- "Feather Spray" 
Copyright (C) 2018 The R Foundation for Statistical Computing 
Platform: x&6 64-w64-mingw32/x64 (64-bit) 





R is free software and comes with ABSOLUTELY NO WARRANTY. 
You are welcome to redistribute it under certain conditions. 
Type 'license()' or 'licence()' for distribution details. 


Natural language support but running in an English locale 


R is à collaborative project with many contributors. 
Type 'contriburors()' for more information and 
'citation[)' on how to cite R or R packages in publications. 


Type 'demo()' for some demos, 'help()' for on-line help, or 
*help.start()* for an NTML browser interface to help. 
Type 'q()' to quit R. 


> myData <- read.csv(file""D:/data.csv", header"IRUE, sep*","); 
> myData; 

x x2 x3 
.21624472 774511945 -4.87198610 
18104835 -100479091 297727175 
69712196 . 328894837 - 92445970 
- €S499099 -4621€7830 - 74972168 
06797834 .053091767 . 35380788 
67543296 918655276 - 56826805 
19982505 063870668 48912276 
91662531 + 953422068 2954082505 
30843063 . 322800550 - 39717775 
12830745 - 929044754 - 63913066 
39507566 270269185 - 26466993 


~-eoOeiwdinaw& QN 
b O0" o0 oO OF FF o t 
RU NUR a A A 


Ho 
i 
Q b be de be b b O O b Ot 


Q i6 Q t i5 Q0 O0 (t 


be 





Figure 2-17. RGui: Reading a CSV 


In the R programming language, you can use the summary() function 
to get the basic descriptive statistics for all the variables. I will discuss 
descriptive statistics, shown in Figure 2-18, in a future chapter of this book. 


summary (myData) ; 


a7 0. 98103955 0.365256100 2.23928744 1 
98 -0.36317757 4.027074061 2.46465216 1 
oo 0.44030641 -0.513103067 -0.18871437 1 
100 2.34219633 3.314183829 -0.14146813 0 
^ summary (myData) ; 

x x3 
21.1038 
0.3758 


2-5.0216 
: 0.8415 
: 2.5856 
: 2.8314 


4.5225 
12.8562 








Figure 2-18. RGui: The summary() function 
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In the R programming language, you can plot a scatterplot using the 
plot() function, as seen in Figure 2-19. 


plot(myData$x, myData$x2); 


73 l.liPQBl7P d.WI4EbéPléb 1.717TRIDÁ d 
7h  1,2485T8254  2,54747T5]44 — 2,77054224 D 
1,045817546  4.4232321344 -2.13222583 1 
AGRO  D.74PRDOITIA GFL I 
0,85495504 Fa 5804559 59 -D.EÜREIZI23 1 
1.204033953 25.5934£6£568557  £6.553287343 j 
-D.STGBEI1UZ  C.60221961J3 3,65022015 CQ 
"U.04REESEEK 4.8HJ432183  5,.555TP4D3 1 
E D.2756DBOU* 5.548460525  3.03484EdAÀ 1 
0.29524083 1,89245253453 8, AS J 
-D,4813T541  3,0184553031  Á2,05051225 1 
l.04718BE60*  5.284824E585 2,754 X2318 1 
D.$I344UI7  Z.924PESUS2  $.21544351 J 
S. QESRIIT7T  5,484315402 Fee Q 
1,04718040 -5D.24481458455 -5,047 KB4BÀ 1 
$.10513834 -2.078210481  B.7$012370 C 
OC PTET AEI -D.URITERESL Fe d 
1.273:05802 Liebe PTET D 
DIRS  2.017868924  1.899242722 1 
QO,SÉLQORPSRS Q.JÁSIPBGIOS Z.2J8IETA4 1 
=ü. JERITE? 4027274041 Da i 
Ò. aadal -«D.*513103047 =ġ. 18871437 1 
i90 Z,24219623 3,.3141:0382* =p. 14140017 C 
^ summary (myDaca]? 
x xi 
Him. r=1.1008 Hin. f-2.259 Kin. <3 0216 
ist Qu.r 0.3734 lat Qu.r 1.278 lac Ga.1 0,8418 
Hadisa : 1.20484 — Hedian : 2.5981 Medias i 2.5858 
Haan : J.p Meim — 0: 2.8314 Marī 


Mrd u.r E.TBOÓ — Sro Ge 4.9238 
toE.554 — Bak. 112.8582 





Figure 2-19. RGui: Plotting a chart 


RStudio is an IDE that provides a GUI for the R programming 
command line application. RStudio provide word suggestions and syntax 
highlighting for the R programming language. The RStudio IDE for the R 
programming language is seen in Figure 2-20. 
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KI Finde 


m x 
Fly da Code Vane Plots Leon Puid Debug Proha — Roch Piip 
e. Cu. S kk Eh » 7 Kids = D frt [Hora] = 
Ü iedit te qualtr meta = eemi — Hübwy Connections a 
bl | eure an tare à fs Aun >+ = Due e cm bj Rhett: of iut = 
1 I Oh Hea Érvranmant = 
"i 
Fürr Pipis — Packager ip — ree aE 
Al pat = 
T Zap lil t R gl 
Cones Teena rM ux! 
& is free software and comes with AMSOLUTELY NO WARRANTY. "n 


WI LL a 
You are welcome to redistribute dt uncer certain comditiors. 
Type "licerme(2' of "licence()' for distribution details. 


m is a collaboracive project with mary contributors. 
Type "congríburors()' for more information are 
"titation()^ on how to cite B or È packages im publications. 


Type "demo[) for some dema3, "help()' for on-line help, or 
"halp.start()" for an HTL browser interface ro help. 
Type gÜ’ ra quit B. 


[sorkspace Toaded from —/, Kata] 


Figure 2-20. The RStudio IDE interface 


With RStudio, you can write all the code into the code editor and run 
the script, as seen in Figures 2-22, 2-23, and 2-24. 


myData <- read.csv(file-"D:/data.csv", header-TRUE, sep-","); 
myData; 

summary(myData); 

plot(myData$x, myData$x2); 


As you type the code, RStudio shows the intelligent code completion, 
as shown in Figure 2-21. 


É RStudio 
File Edit Code View Plots Session Build Debug Profile Tools Help 
O-@ @- HH c A * Addins ~ 
O untitled1* 
td Source on Save Q fe + Run 
1 read.]| 


ip read.csv {utils} ^ 
ə read. csv2 it i 

ə read. dcf 

ə read. delim 
ə read. delim2 
ə read. OIF 

ə read. fortran 
ə read. ftable 


Figure 2-21. RStudio IDE intelligent code completion 
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You must select all the R code in the code editor and click Run or Ctrl + 


Enter to run the script (Figure 2-22). 


Figure 2-22. RStudio IDE: Running a script 


Or you can click Code > Run Region > Run All (Figure 2-23). 


©) RStudio 


File Edit | Code. View Plots Session Build Debug Profile Tools Help 


o -| & 


©  untitle 


un | uw NI 
wu 
c 


5:1 
Console 


~f 
avy Č. 
> Summa 
Min. 


1st Qu 
Median 





Insert Section... 


Jump To... 
Go To File/Function... 


Show Document Outline 
Show Diagnostics 

Go To Help 

Go To Function Definition 
Extract Function 

Extract Variable 

Rename in Scope 

Reflow Comment 
Comment/Uncomment Lines 
Insert Roxygen Skeleton 
Reindent Li 

Reformat Code 

Run Selected Line(s) 
Re-Run Previous 


Send to Terminal 


Source 
Source with Echo 


Source File... 


Ctrl * Shift+R 


Alt« Shift«J 
Cti. 


Ctri+ Shift+O 


Ctrl« Alte X 
Ctrl Alte V 
Ctrl+ Alt+ Shift« M 


Ctrl * Shift+/ 
Ctrl+ Shift C 
Ctrl+ Alte Shift« R 


Ctri+! 
Ctrl Shift+A 


Ctri+ Enter 
Ctrl+ Shift« P 
Ctrl Alte Enter 


Ctrl Shift+S 
Ctrl Shift+ Enter 


CtrleAlteG 


- Addins -= 


r=TRUE, sep=","); 


Run From Beginning To Line 
Run From Line to End 

Run Function Definition 
Run Code Section 


Figure 2-23. RStudio IDE: Running a script 
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— Run 


Ctrl+Alt+B 
Ctrl+Alt+E 
Ctrl- Alt« F 
Ctrl- Alt« T 





cia 





The results are shown in Figure 2-24. 
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Figure 2-24. RStudio IDE: Results after running the R script 


The RStudio IDE offers syntax highlighting features in the code editor. 


When you run the R script, you can view the results in the Console tab and 
see the scatterplot in the Plots tab. By double-clicking myData in the Global 
Environment tab, you can view the data loaded from the .csv file, as shown 


in Figure 2-25. 
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Figure 2-25. RStudio IDE: Viewing the loaded data 


Conclusion 


In this chapter, you looked at R programming. You now understand what 
R programming is: it’s a programming language for statistical computing 
and is supported by the R Foundation for Statistical Computing. R 
programming is used by researchers for statistical analysis, and R 
popularity has increased every year. 

You also explored RStudio. RStudio is the IDE for the R programming 
language and it has syntax highlighting and intelligent code completion to 
assist you in writing R scripts more easily and more productively. You also 
looked at how R can read a .csv file and perform descriptive statistics and 
data visualizations, and you explored the differences between them. 

You also installed R and the RStudio IDE, and you saw how to 
allow RStudio IDE to integrate with the R programming command line 
application. 


36 


CHAPTER 2 GETTING STARTED 


You also learned than an IDE is software to help you write code more 
easily and more productively. IDEs usually offer syntax highlighting and 
intelligent code completion and have a code editor, a compiler, and a 
debugger. 

For R programming, RStudio is the most popular IDE. RStudio has 
a code editor that consists of syntax highlighting and intelligent code 
completion. RStudio also has a workspace showing all the variables and 
history. You may double-click the variables to view them using tables and 
more. The R console is in RStudio so you can view the results of R scripts. 
The Plots and Others portion is available in RStudio to let you view the 


charts and graphs plotted from the R code. 
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Basic Syntax 


You will use R for applied statistics, which can be used in the data 
understanding and modeling stages of the CRISP-DM data mining 
model. R programming is a programming language with object-oriented 
programming features. R programming was created for statistics and is 
used in the academic and research fields. However, before you go into 
statistics, you need to learn to program R scripts. 

In this chapter, you will explore the syntax of R programming. I will 
discuss the R console and code editor in RStudio, as well as R objects 
and the data structure of R programming, from variables and data types 
to lists, vectors, matrices, and data frames. I will also discuss conditional 
statements, loops, and functions. Then you will create a simple calculator 
after learning the basics. 


Writing in R Console 


As you saw in Chapter 2, the R console offers a fast and easy way to do 
statistical calculations and some data visualizations. The R console is also 
like a calculator, so you can always use the R console to calculate some 
math equations. 

To do math calculations, you can just type in some math equations like 


141 
FEF 
[1] 2 
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1-3 
pus 
[1] -2 


1*5 
»1*5 
[1] 5 


1/6 
»1/6 
[1] 0.1666667 


tan(2) 
» tan(2) 
[1] -2.18504 


To do some simple statistical calculations, you can so the following: 
Standard deviation 


A 25 3545 55. 8)) 
rod cls 25 35 4.5, 87) 
[1] 1.870829 


Mean 


mean(c(1, 2, 3, 4, 5, 6)) 
» mean(c(1, 2, 3, 4, 5, 6)) 
[1] 3.5 

Min 


min(c(1, 2, 3, 4, 5,6 )) 
> min(c(1; 25 35.45 5, 6)) 
[1] 1 
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To plot charts or graphs, type 


plot c(i; 25 3s 45 By 18)» C3, Fy Sy 6r 7) 
P pIoRCGUL. 2, 353, 55 8). C2, 350. 5s 6, 1) 


which is shown in Figure 3-1. 








u 
P- 
€ Q - 
u» 
s 
e 
ci = 
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“> 

C 
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c(1, 2, 3, 4, 5, 6) 


Figure 3-1. Scatter plot 


To sum up, the R console, despite being basic, offers the following 


advantages: 


High performance 


Fast prototyping and testing of your ideas and logic 
before proceeding further, such as when developing 


Windows Form applications 
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e Personally, I use the R console application to test 


algorithms and other code fragments when in the 


process of developing complex R scripts. 


Using the Code Editor 


The RStudio IDE offers features like a code editor, debugger, and compiler 


that communicate with the R command line application or R console. The 


R code editor offers features like intelligent code completion and syntax 


highlighting, shown in Figures 3-2 and 3-3, respectively. 


©) Untitled" 


1 read.| 






r read. 
è read. 
ë read. 
è read. 
# read. 
è read. 
ë read. 


m Source on Save L gf * Run - ** Source 


| ^| read.csv(file, header = TRUE, sep = ",", quote = "U" 
ec = ",", fill = TRUE, comment. char = "", ... 





csv2 E 
def i Reads a file in table format and creates a data frame from it, with cases 
delim corresponding to lines and variables to fields in the file. 


delim2 

DIF 

fortran 

ftable tat y 


Press F1 for additiona! halo 


Figure 3-2. Example of intelligent code completion 


©] Untitled1* myData 
rd Source on Save A # - 
1 myData «- read.csv(file-"D:/data.csv", header-TRUE, sep=","); 
2 myData; 
3 | 
4 summary (myData); 
3 


Figure 3-3. Example of syntax highlighting 


To create a new script in RStudio, click 
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File > New > R Script, as shown in Figure 3-4. 


£3 Rstudio 
File. Edit Code View Plots Session Build Debug Profile Tools Help 



















New Project... 
R Notebook 
Open File... Ctri+O 
R Markdown... 
Recent Files b 
Shiny Web App... 
Open Project... 
à ims : Text File 
Open Project in New Session... XS 
++ File 
Recent Projects > 
Import Dataset , 5 
R HTML 
Save Ctrl+S R Presentation 
Save As... R Documentation 
ve All Ctri+ Alt« S 
du : for on-line help, or 
Print erface to help. 
Close Ctrie W 
Close All Ctrl« Shift W 
Close All Except Current Ctrl « Alt- Shift W 


Close Project 


Quit Session... 


Figure 3-4. Creating a new script 


You can then code your R Script. For now, type in the following code, 
shown in Figure 3-5: 


A«- 1; 
B «- 2j 
A/B; 

A C B; 
A + B; 
A - B; 
A^2; 
B^2; 
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© Rstudio 
File Edit Code View Plots Session Build Debug Profile Tools Help 
o. ox | zm -| EH EI | e |! sm Gotofile/function | EE " Addins ~ 
©) Untitled1* x 
|) | El OSeurceonsave | Q ~] | 


1 A <-i: 
2 B<- 2; 
3 
4 A/B; 
5 A * B: 
6 A+ B; 
7 A-B; 
8 
9 A^2; 
10 B^2; 
11 


Figure 3-5. Code in a script 


To run the R script, highlight the code in the code editor and click Run, 
as shown in Figure 3-6. 


© Rstudic 
File Edit Code View Plots ‘Session Build Debug Profile Tools Help 
o. oc a’ -| E E ES & Gotofiletunction | LS = Addins = 
© | Untitied1* = -D 
S| H LiseuceonsSamwe GQ ve | 





Ke 


Figure 3-6. Running the script 
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To view the results of the R script, look in the R console of RStudio, as 


shown in Figure 3-7. 


Figure 3-7. RStudio IDE console results 


You can also see that in the Environment tab, there are two variables, 


as shown in Figure 3-8. 


Environment History Connections E 
<= LJ "Import Dataset ~ — List ~ 
OF Global Environment ~ 
Values 
A 1 
B 2 


Figure 3-8. RStudio IDE Environment tab 
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Adding Comments to the Code 


You can add comments to the code. Comments are text that will not be 
run by the R console. You can add in a comment by putting # in front of 
the text. The comment is for you to describe your code to let anyone read it 


more easily. 


#Create variable A with value 1 
A <- 1j 


#Create variable B with value 2 
B <- 2; 


#Calculate A divide B 
A/B; 


#Calculate A times B 
A B 


#Calculate A plus B 
A + B; 


#Calculate A subtract B 
A - B; 


#Calculate A to power of 2 
A^2; 


#Calculate B to power of 2 
B^2; 


You can rerun the code and you should get the result shown in 
Figure 3-9. 
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Console Terminal dil 


create variable A with value 1 A 
A <= 1 


#Create variable B with value 2 
B e- 2 


scalculare A divide B 
A/B 
] 0.5 


» 
* 
> 
A 
-- 
a 
-- 
[1 
= 
> Calculate A times B 
>A" B 

[1] 2 

zt 
» ulate A plus B 
- 
[1] 
- 
> #Calculate A subtract B 

> A - B 

[1] -1 

> 

> #Calculate A to power of 2 
> AA 
[1] 1 
Es 

> #Calculate B to power of 2 
> BA2 

[1] 4 


E 


Figure 3-9. RStudio IDE reruns the code with comments 


Variables 


Let's look into the code and scripts you used previously. You actually 
created two variables, A and B, and assigned some values to the two 


variables. 
A <- 1 
B <- 2 


In this code, A is a variable, and B is a variable also. <- means assign. A 
<- 1means variable A is assigned a value of 1. 1is a numeric type. B <- 2 
means variable B is assigned a value of 2. 2 is a numeric type. 

If you want to assign text or character values, you add quotations, like 


A «- "Hello World" 
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Variable A is assigned a text value of "Hello World". Character and 
numeric are data types. 


Data Types 


Data types are the types or kind of information or data a variable is 
holding. A data type can be numeric and character. 
For example, 


A «- "abc" 
B <- 1.2 


In R, data types are automatically determined. Because of the 
quotations surrounding the values, variable A is of the character data type, 
while variable B is of the numeric data type. 


R is also capable of storing other data types, as shown in Table 3-1. 


Table 3-1. Data Types 
Data Types Values 
Logical TRUE 

FALSE 


Numeric 12.3 
2.55 
1.0 


Character “a” 
“abc” 
“this is a bat” 


For more information, please see ww. tutorialspoint.com/r/r 
data types.htm. 
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You can also determine the data type of a variable by using the class() 


method. For example: 


A «- "ABC"; 
print(class(A)); 


> A «- "ABC"; 
> print(class(A)); 
[1] "character" 


A «- 123; 
print(class(A)); 


> A <- 123; 
> print(class(A)); 
[1] "numeric" 


A «- TRUE; 
print(class(A)); 


» A «- TRUE; 
> print(class(A)); 
[1] "logical" 


Why is the data type important? If you do a math calculation in R and 


one variable's data type is numeric and one variable's data type is non- 


numeric, you will get the following error: 


> A <- 123; 
> B <- "aaa"; 
>A + B; 


Error in A + B : non-numeric argument to binary operator 
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You can also use is.datatype() to determine whether a variable is of a 
certain data type: 


> A <- 123; 

> print(is.numeric(A)); 
[1] TRUE 

> print(is.character(A) ); 
[1] FALSE 


You can also use as.datatype() to convert between data types: 


> A <- 12; 
> B «- "56"; 
>A + B; 


Error in A + B : non-numeric argument to binary operator 
> B <- as.numeric(B); 

>A + B; 

[1] 68 


A <- 12 means that Ais a numeric data type. B <- "56" means that B 
is a character data type. When A and B add together, you will get an error 
because you are adding a numeric data type to a character data type. 

If you try to convert B to the numeric data type using B <- 
as.numeric(B), you can add A and B together because A is a numeric data 


type and B is a numeric data type also. 


Vectors 


A vector is a basic data structure or R object for storing a set of values of the 
same data type. A vector is the most basic and common data structure in 

R. A vector is used when you want to store and modify a set of values. The 
data types can be logical, integer, double, and character. The integer data 
type is used to store number values without a decimal, and the double data 
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type is used to store number values with a decimal. Vectors can be created 
using the c() function as follows: 


variable = c(..., ..., ...) 
>A <- c(1, 2, 3, 4, 5, 6); 
> print(A); 

[1] 123456 


You can check the data type of the vector using typeof() and class(): 


>typeof (A); 
[1] "double" 
> class(A); 
[1] "numeric" 


You can check the number of elements or values in a vector using the 
length() function: 


> A <- c(1, 25 3, 4, 5, 6); 
> print(A); 

[1] 123456 

> length(A); 

[1] 6 


You can also use the operator : to create a vector: 


> A <- 1:8; 
> print(A); 
[1] 1234567 8 


To retrieve the second element or value of a vector, use the [ | brackets 
and put in the element number to retrieve: 
> A <- 1:8; 
> print(A); 
[1] 12345678 
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You can also retrieve the elements in the vector using another vector, 


for example, to retrieve the second and fifth element: 


> A <- 1:8; 
> print(A); 
[1]1234567 8 


To retrieve all elements except the second element, do this: 


> A <- 1:8; 
> print(A); 

[1] 12345678 
> A[-2]; 
[1] 1345678 

You can also retrieve elements of a vector using a logical vector: 


> A <- 1:8; 
> print(A); 
[1]12345678 
> A[C(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)]; 
[1] 246 8 

You can also use more than or less than signs to retrieve elements: 


> A <- 1:8; 

> print(A); 

[1] 12345678 
A[A > 5]; 

| 


> 
[1] 678 
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You can modify a vector as follows using assign, « -: 


> A <- 1:8; 
> print(A); 
[1] 12345678 
SALSI 9:93 
» print(A); 
[1] 12945678 


Lists 


A list is like a vector. It is an R object that can store a set of values or 
elements, but a list can store values of different data types. A list is also 
another common data structure in R. You use a list when you want to 
modify and store a set of values of different data types. A vector can only 
store values of the same data type. The syntax to create a list is as follows: 


variable - list(..., ..., ...) 
To create a list, do this: 


> A <- list("a", "b", 1, 2); 
» print(A); 

[[1]] 

[1] "a" 
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To see the element data type or the data structure type of the list, you 
can use the str() and typeof () functions: 


> str(A); 
List of 4 

$ : chr "a" 
$ : chr "b" 
$ : num 1 

$ : num 2 
>typeof (A); 
[1] "list" 


You can get the length or number of elements in the list by using the 
length() function: 


> A <- list("a", "b", 1, 2); 
> length(A); 
[1] 4 


You can retrieve the values in the list using an integer: 
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You can retrieve the values using a negative integer: 
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[[4]] 
2 


[1] 


A[c(TRUE, FALSE, FALSE, FALSE)]; 
[1]] 


1] "a" 


> 
l 
l 


When you use only | ] to retrieve a value, it will give you the sublist. If 
you want to get the content in the sublist, you need to use [ | | ]. 


You can modify a value or element in the list using 


> ET 


[[2]] «- "n^; 
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To delete an element or value in a list: 


> print(A); 


E 


eC 
LL e 
LII LI LII LI LII LI LII LI 


c 


N ro 
LL ve 


» A[[2]] «- NULL; 
» print(A); 
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Matrix 


A matrix is like a vector, but it has two dimensions. You usually use a 
matrix to modify and store values from a data set because a matrix has two 
dimensions. A matrix is good when you plan to do linear algebra types or 
mathematical operations. For a data set with different types, you need to 
use a data frame. 

To create a matrix, you can use the following syntax: 


variable «- matrix(vector, nrow-n, ncol-i) 

» A «- matrix(c(1, 2, 3, 4, 6, 7, 8, 9, 0), nrow-3, ncol-3); 

» print(A); 
[51] [5.2] [53] 

1 4 8 

2 6 9 

3 7 0 


You can use dimnames to rename the rows and columns: 


> A <- matrix(c(1, 2, 3, 4, 6, 7, 8, 9, 0), nrow-3, ncol=3); 
» print(A); 
[51] [52] b3] 
[5] 1 4 8 
6 9 
[35] 3 7 0 
> A <- matrix(c(1, 2, 3, 4, 6, 7, 8, 9, 0), nrow-3, ncol=3, 
dimnames=list(c("X", "Y", "Z"), c("A", "S", "D"))); 
» print(A); 
ASD 
X148 
Y269 
Z370 
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You can check the dimension of the matrix using the attributes () 
function and whether a variable is a matrix using the class() function: 


> class(A); 
[1] "matrix" 
> attributes(A); 


$'dim 

[113.3 
$dimnames 
$dimnames[ [1] | 
[1] "X" "y" "z" 
$dimnames|[ [2] | 
[1] "A" "S" "D" 


You can get column names and row names using the colnames() and 
rownames() functions: 


>colnames (A); 
[14] "A" "S" "D" 
»rownames (A); 
[1] oor nus don 


You can also create a matrix by using column binding and row binding 
functions: 


» B «- cbind(c(1, 2, 3), c(4, 5, 6)); 
» print(B); 


] 

[1,] 1 4 

[2,] 2 5 

[3,] 3 6 

> C <- rbind(c(1, 2, 3), c(4, 5, 6)); 
> print(C); 
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To select the first row: 


Vv 


print(A); 
ASD 
148 
269 
370 
A[1, ]; 
S D 
4 8 


e D yY N< X 


To select the first column: 


print(A); 
ASD 
14 8 
269 
370 
Al, 1]; 
YZ 
2 3 


7v 


e X VN c X 


To select all rows except the last row: 


print(A); 
ASD 
X148 
Y269 
Z370 
> A[-3,]; 


7v 
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ASD 
X14 8 
Y269 


To select the second row and second column: 


> print(A); 
ASD 
X14 8 
Y269 
Z370 
> A[2, 2]; 
[1] 6 


Using a logical vector to select the first and last row: 


> print(A); 
ASD 
X148 
Y269 
£370 
> A[c(TRUE, FALSE, FALSE), ]; 
ASD 
148 
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To select the elements and values based on more than and less than: 


> print(A); 
ASD 

X148 

Y269 

Z370 

> A[A>4]; 

[1] 67 89 
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To modify the second row and second column: 


print(A); 
ASD 
148 
269 
370 
A[2, 2] «- 100; 
print(A); 
A SD 
X1 48 
Y 2 100 9 
Z3 70 


7v 


v VN < X 


To add a row, use the rbind() function: 


print(A); 
A SD 
1 48 
2 100 9 
3 70 
B <- rbind(A, c(1, 2, 3)); 
print(B); 
A SD 
X1 48 
Y 2 100 9 
Z3 70 
1 23 


Vv 


v VN ~< X 


To add a column, use the cbind() function: 


7v 


print(A); 
A SD 
X1 48 
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2 100 9 

3 70 

C «- cbind(A, c(1, 2, 3)); 
print(C); 

A SD 

1 481 


Y 2 10092 


v VN - X v 


nN > 


3 703 
To transpose a matrix, use the t() function: 


print(A); 
A SD 
1 48 
2 100 9 
3 70 
A <- t(A); 
print(A); 
X YZ 
1 23 
4 100 7 


D8 90 


Data Frame 


A data frame is a special list or R object that is multidimensional and is 
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usually used to store data read from an Excel or .csv file. A matrix can only 


store values of the same type, but a data frame can store values of different 


types. To declare a data frame, use the following syntax: 


variable = data.frame(colNamei = c(..., ..., cee); 


colName2 = c(..., ..., ...), eee) 
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> A <- data. frame(emp id-c(1, 2, 3), names-c("John", "James", 
"Mary"), salary=c(111.1, 222.2, 333.3)); 

> print(A); 

emp id names salary 


1 1 John 111.1 
2 2 James 222.2 
3 3 Mary 333.3 


You can use the typeof() and class() functions to check whether a 
variable is of the data frame type: 


>typeof (A); 
[1] "list" 
> class(A); 
[1] "data. frame" 


To get the number of columns and rows, you can use the ncol() and 
nrow() functions: 


>ncol(A); 
[1] 3 
»nrow(A); 
[1] 3 


To get the structure of the data frame, you can use the str() function: 


> str(A); 
'data.frame': 3 obs. of 3 variables: 
$ emp id: num 1 2 3 


$ names : Factor w/ 3 levels "James","John",..: 2 1 3 
$ salary: num 111 222 333 
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You can also use the read. csv() function to read a .csv file as a data 


frame: 


»myData«- read.csv(file="D:/data.csv", header=TRUE, sep=","); 
>ncol(myData) ; 

[1] 4 

>nrow(myData) ; 

[1] 100 

> str(myData) ; 

'data.frame': 100 obs. of 4 variables: 

$x: num 2.216 -0.181 1.697 1.655 1.068 ... 
$ x2: num 4.77 4.1 2.33 2.46 1.05 ... 

$ x3: num -4.87 6.98 3.92 0.75 3.35 ... 
$y:int 0100111111... 


To select a column, use [ ], [[ ] ], or $ to select: 


» print(A); 
emp id names salary 


T 1 John 111.1 
2 2 James 222.2 
3 3 Mary 333.3 
> A["names"]; 

names 
1 John 
2 James 
3 Mary 
»A$names; 


[1] John James Mary 
Levels: James John Mary 
> AL[2]]; 

[1] John James Mary 
Levels: James John Mary 
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To modify the first row and second column: 


» print(A); 
emp id names salary 


1 1 John 111.1 
2 2 James 222.2 
3 3 Mary 333.3 


» A[1, 2] «- "James"; 
» print(A); 
emp id names salary 


1 1 James 111.1 
2 2 James 222.2 
3 3 Mary 333.3 


To add a row, use the rbind() function: 


» print(A); 
emp id names salary 


1 1 James 111.1 
2 2 James 222.2 
3 3 Mary 333.3 


» B «- rbind(A, list(4, "John", 444.4)); 
» print(B); 
emp id names salary 


1 1 James 111.1 
2 2 James 222.2 
3 3 Mary 333.3 
4 4 John 444.4 


To add a column, use the cbind( ) function: 


» print(A); 
emp id names salary 
1 1 James 111.1 
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2 2 James 222.2 

3 3 Mary 333.3 

> B «- cbind(A, state=c("NY", "NY", "NY")); 
» print(B); 

emp id names salary state 


1 1 James 111.1 NY 
2 2 James 222.2 NY 
3 3 Mary 333.3 NY 


To delete a column: 


» print(A); 
emp id names salary 


1 1 James 111.1 
2 2 James 222.2 
3 3 Mary 333.3 


»A$salary«- NULL; 
» print(A); 
emp id names 


1 1 James 
2 2 James 
3 3 Mary 


Logical Statements 
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if...else statements are usually the logical fragments of your code in 


R. They give your program some intelligence and decision making by 


specifying the if rules: 


if (Boolean expression) { 
#Codes to execute if Boolean expression is true 
Jelse { 


#code to execute if Boolean expression is false } 
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Table 3-2 shows the Boolean operators that can be used in writing the 
Boolean expressions of the if... else statements. 


Table 3-2. Boolean Operators 


Boolean Operator Definition 


== Equal to 

>= Greater than or equal to 
<= Lesser than or equal to 
> Greater than 

< Lesser than 

l= Not equal to 


YouWe can putin more if...else statements using the else ifto 
have more rules and intelligence in the program code: 


if (Boolean expression 1) { 
#Codes to execute if Boolean expression 1 is true 
} else if (Boolean expression2) { 
#Codes to execute if Boolean expression 2 is true and Boolean 
expression 1 is false 
} else if(Boolean expression 3) { 
#Codes to execute if Boolean expression 3 is true and Boolean 
expression 1 and 2 are false 
} else { 
#code to execute if all Boolean expressionsare false } 


The following is an example of using else if: 


> A <- c(1, 2); 
> B <- sum(A); #1 + 2 
> 
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> if(B >= 5) { 

+  print("B is greater or equal to 5"); 

+ } else if(B >= 3) { 

+  print("B is more than or equal to 3"); 


+ ) else { 

+  print("B is less than 3"); 

+ } 

[1] "B is more than or equal to 3" 


If B is more than or equal to 5, the R console will print “B is greater or 
equal to 5” Else if B is more than or equal to 3 but less than 5, the R console 
will print "B is more than or equal to 3* Otherwise, the R console will print 
“B is lesser than 5* The R console printed "B is more than or equal to 3” as 
Bis1+2=3. 


Loops 


Loops are used to repeat certain fragments of code. For example, if you 
want print the “This is R message 100 times, it will be very tiresome to 
type print("This is R. "); 100 times. You can use loops to print the 
message 100 times more easily. Loops can usually be used to go through 
a set of vectors, lists, or data frames. In R, there are several loop options: 
both while loop, for loop, and repeat loop. 


For Loop 


Let's start with the syntax for a for loop: 


for (value in vector) ( 
#statements 
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Example: 
> A <- c(1:5); #create a vector with values 1, 2, 3, 4, 5 
> 
> for(e in A) { #for each element and value in vector A 
+  print(e); #print the element and value 
+} 
EIE. 
[1] 2 
[1] 3 
[1] 4 
[1] 5 


In this example, you create vector A, with values 1 to 5. For each 
element in the vector, you print the element in the console. See Figure 3-10 
for the logic. 


Get each element 
of the vector. 


Process the 
statements 


Last 
element 
reached? 





Figure 3-10. For loop of R 
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While Loop 


You can also use while loop to loop until you meet a specific Boolean 
expression: 


While (Boolean Expression) { 
#Code to run or repeat until Bolean Expression is false 


j 


Example: 


»l«- 1j 

> 

> while(i«- 10) { 
+  print(i); 

i«- 141; 


e e Ee H 


EEF e he HB 


mm mra r— r— r— ee r— es r4 r4 0 + 
WEN 
LL) LL.) LL] LL) LJ LE] LC LJ LJ LL v 
O O N DUN A WW N HR 


MES 
© 


In this example, you use the while loop to print the text from 1 to 10. 
I is assigned a value of 1. While 1 is less than or equal to 10, the R console 
will print the value of 1. After printing, I will be added to 1. It repeats 


printing until it is more than 10. See Figure 3-11 for the logic flow. 
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Is condition true? 


Yes 


Code Block 





Figure 3-11. While loop of R 


Break and Next Keywords 


In loop statements, you can use the break keyword and the next keyword. 
The break keyword is to stop the iterations of the loop. The next keyword 
is to skip the current iteration of a loop. 

Example for the break keyword: 


A <- c(1:10); 


for(e in A) { 


break; 


j 


> 
> 
> 
+ 
+ if(e == 5) { 
" 
" 
à 


(2 
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In this example, you create vector A with values from 1 to 10. For each 
element in A, you print the element values. If the element value is 5, you 
exit from the loop. 

Example for the next keyword: 


A <- c(1:10); 
for(e in A) { 


if(e == 5) { 
next; 


j 
print(e); 


1 
1 
1 
1 
1 
1 
1 
1 
1 


O CON Oo A Uu N HB 


> 
> 
> 
+ 
+ 
+ 
+ 
+ 
+ 
+ 
l 
l 
l 
l 
l 
l 
l 
l 
l 


LL LII id eee l l d id id we 


MES 
© 
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In this example, you create vector A with values from 1 to 10. For each 
element in A, you print the element values. If the element value is 5, you 
skip the current iteration and go to the next iteration. Thus, you did not 
print the value 5. An example of when you want to use the break statement 
is when you use the loop statements to iterate through a set of values; 
when you find the values you want, you can break out from the loop. 


Repeat Loop 


The repeat loop repeats the code many times, and there is no Boolean 
expression to meet. To stop the loop, you must use the break keyword. 


repeat 1 
#code to repeat 
j 

Example: 
>1<- 1j 
> repeat { 
+  if(i» 10) 
+ break; 
n 
+  print(i); 
+ 1<- 141; 
tj 
[1] 1 
[1] 2 
EH EE 
[1] 4 
[1] 5 
[1] 6 
[1] 7 
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In this example, you create variable 1 with a value of 1. You repeat 
printing 1 and add 1 to 1 until 1 is more than 10, and then you break 
out from the loop. If you forget or don’t add a condition statement and 
the break keyword, you can end up in an infinite loop. An infinite loop 
is dangerous because it can consume your system resources and cause 
your program to keep on looping at the same place. The for loop is the 
preferable loop because the condition is defined in the first statement. 


for(e in A) .... 


The while loop is the next preferred loop. A condition statement is also 
stated in the first statement: 


while(ic= ...) ... 


If you forget to increment the i, I <- I + 1; , you will also have an 


infinite loop. 


Functions 


Functions help you organize your code and allow you to reuse code 
fragments whenever you need. To create a function, use the following 


syntax: 


function name«- function(arg1, arg2, ...) { 
# Codes fragments 
function name = #value to return 
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Example: 


A <- c(1:5); 


productVect <- function(a) { 


> 

> 

> 

+ 

+ 

+ 

+ for(e in a) { 
+ fes <= Yes * e 
2 

+ 

x 

" 

> 

> 

l 


productVect = res; 


} 


print(productVect(A)); 
l 


1} 120 


In this example, you create the productVect() function. This function 
is the same as prod() in R programming. 

The productVect() function takes one argument. For every element 
in the argument (should be a vector), res will be equal to res times the 
element. After the loop is completed, the productVect() function will 
return the res value. 

You can call the function using productVect (A). In the above code, 
you call the function using 


A «- c(1:5); 
print(productVect(A)); 

Ais a vector with values from 1 to 5. You call the function by using 
productVect (A). The argument, a, in the function declaration is the formal 


argument. The argument, A, you passed to the function while calling the 
productVect() function is called the actual argument. When you call the 
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function using productVect (A), a, the formal argument, is assigned with A, 
the actual argument. 
You can call the productVect() function a few times in your code: 


»productVect«- function(a) { 
res «- 1j 
for(e in a) { 

res «- res * ej 


j 


productVect - res; 


v + + + + + + + + + 


> A <- c(1:5); 

> print(productVect(A)); 
[1] 120 

> 

> B <- c(1:10); 

> print(productVect(B)); 
[1] 3628800 


You can also create default values for the argument in the function: 
productVect<- function(a=c(1:5)) { 
res <- 1; 


for(e in a) { 
res <- res * e; 


j 


productVect - res; 


j 
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print (productVect()); 


A <- c(1:5); 
print(productVect(A)); 


B «- c(1:10); 
print(productVect(B)); 


When you call the function without argument, print (productVect()), 
you can still get the result: 


>productVect<- function(a=c(1:5)) { 
res <- 1; 


for(e in a) { 
res <- res * e; 


j 


productVect - res; 


j 


print(productVect()); 
] 


1| 1270 


v æ=“ y 4 + + + + + + + + + 


> A <- c(1:5); 

> print(productVect(4)); 
[1] 120 

> 

> B <- c(1:10); 

> print(productVect(B)); 
[1] 3628800 
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Instead of using productVect = res, you can also use the return() 
function to return the results: 


>productVect<- function(a=c(1:5)) { 
res <- 1; 


for(e in a) { 
res <- res * e; 


j 


return(res); 


print(productVect()); 
] 


1] 120 


v =~ v 4 + + + + + + + + + 


> A <- c(1:5); 

> print(productVect(A)); 
[1] 120 

> 

> B <- c(1:10); 

> print(productVect(B)); 
[1] 3628800 
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Create Your Own Calculator 


In this chapter, you learned many R programming syntaxes. You looked 
into variables, data types, matrix, vectors, lists, data frames, loops, and 
functions. In this section, you are going to put many things you just 
learned into a Calculator script: 


add «- function(a, b) ( 
res <- a + bj 
return(res); 

} 

subtract <- function(a, b) { 
res <- a - bj 
return(res); 

} 

product <- function(a, b) { 
res <- a * b; 
return(res); 

} 

division <- function(a, b) { 
res <- a / b; 
return(res); 


} 


print("Select your option: "); 
print("1. Add"); 

print("2. Subtract"); 
print("3. Product"); 

print("4. Division"); 
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opt «- as.integer(readline(prompt = "> ")); 
firstNum«- as.integer(readline(prompt-"Enter first number: ")); 
secondNum«- as.integer(readline(prompt-"Enter second number: 


"))3 


res «- 0j 
if(opt == 1) 4 

res <- add(firstNum, secondNum) ; 
} else if(opt == 2) { 

res <- subtract(firstNum, secondNum) ; 
) else if(opt == 3) { 

res «- product(firstNum, secondNum); 
) else if(opt == 4) { 

res «- division(firstNum, secondNum); 
} else { 

print("Error. "); 


} 
print(res); 


In this code, you create the add(), subtract(), product(), and 
division() functions. You then print the messages “Select your option:” 
and etc. You assign the opt variable with the user input. You get the 
user input using the readline() function, and you use as. integer() to 
convert the user input to the integer type. You use if statements to call the 


functions based on the user input. You then print the results. 
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To run the code, go to Code > Run Region > Run All, as shown in 
Figure 3-12. 


o RStudio 
File Edit Code View Plots Session Build Debug Profile Tools Help 
oO -œ Insert Section... Ctrl« Shifts R * Addins ~ 
© Untitle Jump To... Alt+ Shift«J 
Go To File/Function... Ctrl^. -^*Run ^e 4 
iv 
1 Show Document Outline Ctrle Shift«O 
13- d Show Diagnostics 
14 
15 
16 ) Go To Help 
17 Go To Function Definition 
18 p 
19 p Extract Function Ctri« Alt« X 
sy a Extract Variable Ctris Alts V 
22 p Rename in Scope Ctri+ Alt« Shift« M 
23 
24 o Reflow Comment Ctrl» Shift / 
25 f : nter first number: ")); 
M Comment/Uncomment Lines Ctrl« Shift« C espe» datda ber: ")); 
27 Insert Roxygen Skeleton Ctri« Alt« Shift« R 
28 r 
29- i Reindent Lines Ctrlel 
31. } Reformat Code Ctrle Shifts A 
= _ Run Selected Line(s) Ctris Enter 
34 Re-Run Previous Ctrl Shift« P 
3e | Rumfegen TF un trom Beginning ToLine — Chi«AR-B 
37- } Send to Terminal Ctri+ Alt- Enter Run From Line to End Ctrl+ Alt«E 
38 : S 
39 ) Source Ctrl+ Shift« S Run Function Definition Ctrl Alt« F 
2 p Source with Echo Ctrl» Shift» Enter Run Code Section Ctrle Alt T 
2 | soca. cac M 





Figure 3-12. Running the Calculator R script in the RStudio IDE 
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The output is shown in Figure 3-13. 


Console Terminal 


» print("1. Add"); 
[1] "1. Add" 


» print("2. Subtract"); 
(1] "2. Subtract" 


> print("3. Product"); 
(1] "3. Product" 


» print("4. Division"); 
[1] "4. Division" 


> opt <- as. integer(readline(prompt = "> ")); 
>1 


> firstNum <- as. integer(readline(prompt = "Enter first number: ")); 
Enter first number: 2 


> secondNum <- as. integer(readline(prompt = "Enter second number: ")); 
Enter second number: 3 


> res <- 0; 


> if (opt == 1) { 

res <- add(firstNum, secondNum); 
} else if(opt == 2) ( 

res «- subtract(firstNum, secondNum); 
} else if(opt == 3) ( 

re .... [TRUNCATED] 


+++ ++ 


> print(res); 
[1] 5 


> 


Figure 3-13. Running the results of the Calculator R script in the 
RStudio IDE 


Conclusion 


In this chapter, you looked into R programming. You explored the R 
console and the RStudio code editor. The R console is for shorter code and 
the RStudio code editor is for longer R code or scripts. 

You learned about variables. A variable is a container to store some 
values. A variable can have a name, which is called a variable name. Data 
types are the types or kind of information or data a variable is holding. 
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You also looked into vectors. A vector is a basic data structure or R 
object to store a set of values of the same data type. The data types can be 
logical, integer, double, character, and more. Vectors can be created using 
the c() function. 

You also learned about lists. Lists are like vectors: they are R objects 
that can store a set of values or elements, but a list can store values of 
different data types. 

You also learned about matrices. A matrix is an R object or two- 
dimensional data structure that is like vector, but has two dimensions. 

You also learned about data frames. A data frame is a special list or R 
object that is multidimensional and is usually used to store data read from 
Excel or .csv files. 

You also learned about conditional statements. if...else statements 
are usually the logical fragments of your code in R. They give your program 
some intelligence and decision making ability by specifying the if rules. 

You also learned about loops. Loops are used to repeat certain 
fragments of code. For example, if you want print the “This is R^ message 
100 times, it will be very tiresome to type print("This is R."); 100 
times. You can use loops to print the message 100 times more easily. R has 
both while loops, for loops, and repeat loops. 

You also learned about functions. Functions help you organize your 
code and allow you to reuse code fragments whenever you need. 

Finally, you created your own calculator based on what you learned. 
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Descriptive Statistics 


Descriptive statistics is a set of math used to summarize data. Descriptive 
statistics can be distribution, central tendency, and dispersion of data. 
The distribution can be a normal distribution or binomial distribution. 
The central tendency can be mean, median, and mode. The dispersion or 
spreadness can be the range, interquartile range, variance, and standard 
deviation. 

In this chapter, you will import a CSV file, Excel file, and SPSS file, 
and you will perform basic data processing. I will explain descriptive 
statistics, central tendency measurements, dispersion measurements, 
and distributions. You will look into how R programming can be used to 
calculate all these values, and how to test and see whether data is normally 
distributed. 


What Is Descriptive Statistics? 


Descriptive statistics summarizes the data and usually focuses on the 
distribution, the central tendency, and dispersion of the data. The 
distributions can be normal distribution, binomial distribution, and other 
distributions like Bernoulli distribution. Binomial distribution and normal 
distribution are the more popular and important distributions, especially 
normal distribution. When exploring data and many statistical tests, you 
will usually look for the normality of the data, which is how normal the 
data is or how likely it is that the data is normally distributed. The Central 
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Limit Theorem states that the mean of a sample or subset of a distribution 
will be equal to the normal distribution mean when the sample size 
increases, regardless whether the sample is from a normal distribution. 
The central tendency, not the central limit theorem, is used to describe 
the data with respect to the center of the data. Central tendency can be the 
mean, median, and mode of the data. The dispersion describes the spread 
of the data, and dispersion can be the variance, standard deviation, and 
interquantile range. 

Descriptive statistics summarizes the data set, lets us have a feel 
and understanding of the data and variables, and allows us to decide 
or determine whether we should use inferential statistics to identify the 
relationship between data sets or use regression analysis to identify the 
relationships between variables. 


Reading Data Files 


R programming allow you to import a data set, which can be comma- 
separated values (CSV) file, Excel file, tab-separated file, JSON file, or 
others. Reading data into the R console or R is important, since you must 
have some data before you can do statistical computing and understand 
the data. 
Before you look into importing data into the R console, you must 
determine your workplace or work directory first. You should always 
set the current workspace directory to tell R the location of your current 
project folder. This allows for easier references to data files and scripts. 
To print the current work directory, you use the getwd() function: 


# get the current workspace location 


print(getwd()); 
» print(getwd()); 
[1] "C:/Users/gohmi/Documents " 


88 


CHAPTER 4 DESCRIPTIVE STATISTICS 
You can set the work directory using the setwd() function: 
#set the current workspace location 


setwd("D:/R");  fsinput your own file directory, for here 
we use "D:/R" 
> setwd("D:/R"); 


To get the new work directory location, you can use the getwd() 


function: 


#get the new workspace 
print(getwd()); 

> print(getwd()); 

[1] "D:/R" 


You can put the data. csv data set into D: /R folder. 


Reading a CSV File 


You can read the data CSV file into the R console using the read.csv() 


function: 
> data <- read.csv(file="data.csv", header=TRUE, sep=","); 


You can view the data by clicking the data in the Global Environment 
portion of the RStudio IDE. The data will be displayed in the table form. 
You can read the file using data. csv because you have set the work 
directory to D: /R, so file-"data.csv" refers to D: /R/data.csv. See 
Figure 4-1. 
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Figure 4-1. View data in table form in RStudio 


The data type of the data variable is data frame. You can determine 
whether the data type is data frame using the class() function: 


» class(data); 
[1] "data.frame" 


The function you use to read a .csv file is 
> data «- read.csv(file-"data.csv", header-TRUE, sep=","); 


file is the name or the data file path that you are going to read. header 
is a logical value to determine whether the names of the variables are 
in the first line. sep is the separator character, and quote is the quoting 
characters with "V" for” and "V" for’. You can add in the quote for double 


quotation as follows: 


> data <- read.csv(file-"data.csv", header-TRUE, sep=","); 
> View(data); 
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Writing a CSV File 


To write a CSV file, you can use the write.csv() function: 


> write.csv(data, file="data2.csv", quote=TRUE, na="na", row. 
names=FALSE) ; 


The exported file is shown in Figure 4-2. 
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Figure 4-2. Exported csv file opened in Microsoft Excel 


In the write.csv() function you used previously to export the CSV file, 


> write.csv(data, file="data2.csv", quote=TRUE, na="na", row. 
names=FALSE); 


data is the variable of the data frame type you would like to export, 
file is the file path or location to export, quote is a logical value to state 
whether to have quotations, na is the string value to use for missing values, 
and row.names is a logical value to indicate whether the row names should 


be written. 
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Reading an Excel File 


The data set can also be in the Excel format or .xlsx format. To read an 
Excel file, you need to use the xlsx package. The xlsx package requires 
a Java runtime, so you must install it on your computer. To install the 
xlsx package, go to the R console and type the following, also shown in 
Figure 4-3: 


» install.packages("xlsx"); 


Console Terminal 


> install.packages("xlsx"); 

Installing package into 'C:/Users/gohminghu188 /Documents /R/win-library/3.5"' 
(as 'lib' is unspecified) 

trying URL 'https://cran.rstudio. com/bin/windows /contrib/3.5/x1sx 0.6.1.zip' 
Content type 'application/zip' length 458723 bytes (447 KB) 

downloaded 447 KB 


package 'xlsx' successfully unpacked and MD5 sums checked 


The downloaded binary packages are in 
C: Nusers*gohminghui 88*AppData*NLocalTempXREmpUde 3sr'downloaded packages 


> | 


Figure 4-3. xlsx package installation in RStudio 


To use the xlsx package, use the require() function: 


> require("xlsx"); 
Loading required package: xlsx 


To read the Excel file, you can use the read. x1sx() function: 
> data «- read.xlsx(file="data.xlsx", 1); 


file is the location of the Excel file. 1 refers to sheet number 1. 

To view the data variable, you can use the View() function or click 
the data variable in the Environment portion of RStudio, as shown in 
Figure 4-4. 
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Console Terminal 
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Figure 4-4. View(data) in RStudio 
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“U. 1 3300 
0.284796 
-0.171494 
0.242616 


To look for the documentation of read.xlsx(), you can use the 


following code, as shown in Figure 4-5: 


» help(read.xlsx); 
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Figure 4-5. Documentation of read.xlsx() 


The data variable is of the data frame data type: 


» class(data); 
[1] "data.frame" 


Writing an Excel File 


To write a Excel file, you can use the write.xlsx() function: 


> write.xlsx(data, file="data2.xlsx", sheetName="sheeti", col. 
names=TRUE, row.names=FALSE) ; 
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data is the variable of data frame type to export to Excel file, file is the 
file location or path, sheetName is the sheet name, and col.names and row. 
names are logical values to state whether to export with column names or 
row names. 

To view the documentation of the write.xlsx() function or any R 


function, you can use the help() function. 
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Figure 4-6. Exported xlsx file 


Reading an SPSS File 


To read an SPSS file, you need to use the foreign package. You can install 
the foreign package using the install.packages() function: 


> install.packages(" foreign"); 

Installing package into 'C:/Users/gohmi/Documents/R/win- 
library/3.5' 

(as 'lib' is unspecified) 

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
foreign 0.8-71.zip' 
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Content type 'application/zip' length 324526 bytes (316 KB) 
downloaded 316 KB 


package 'foreign' successfully unpacked and MD5 sums checked 


The downloaded binary packages are in 
C: \Users\gohmi\AppData\Local\Temp\RtmpET26Hv\downloaded_ 
packages 


To use the foreign package, use the require() function: 


» require(foreign); 
Loading required package: foreign 


To read the SPSS file to a data frame type, you use the read. spss() 


function: 
» data «- read.spss(file-"data.spss", to.data.frame-TRUE); 


file is the file path or location to read the SPSS file. to. data. frame isa 
logical value to state whether to read the SPSS file to a data frame type. 

You can use the help() function to get the documentation of the read. 
spss() function, as shown in Figure 4-7. 
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Figure 4-7. Documentation of read.spss() 
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Writing an SPSS File 


You can write the SPSS file using the write. foreign() function: 


> write.foreign(data, "mydata.txt", "mydata.sps", 
package-"SPSS"); 


data is the variable to export to the SPSS data file, mydata. txt is the 
data in comma-delimited format, mydata.sps is the basic syntax file to 
read the data file into SPSS, and package determines the outputing or 
exporting to the SPSS format. 


Reading a JSON File 


JSON, or JavaScript Object Notation, is a very popular data interchange 

format that is easy for humans to write or read. A JSON file can be read by 
using the rjson package. To install the rjson package, you use the install. 
packages () function: 


» install.packages("rjson"); 

Installing package into 'C:/Users/gohmi/Documents/R/win- 
library/3.5' 

(as 'lib' is unspecified) 

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
rjson 0.2.20.zip' 

Content type 'application/zip' length 577826 bytes (564 KB) 
downloaded 564 KB 


package 'rjson' successfully unpacked and MD5 sums checked 


The downloaded binary packages are in 
C: \Users\gohmi\AppData\Local\Temp\RtmpET26Hv\downloaded_ 
packages 
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You can use the require() function to load the rjson package: 


> require(rjson); 
Loading required package: rjson 


You can read JSON file using the fromJSON() function: 
> data <- fromJSON(file="data.json") ; 
To convert the data to a data frame type, you can use this code: 


> data2 <- as.data.frame(data) ; 


Basic Data Processing 


After importing the data, you may need to do some simple data processing 
like selecting data, sorting data, filtering data, getting unique values, and 


removing missing values. 


Selecting Data 


You can select a few columns from the data using a vector: 


> data; 

X x2 x3 y 
T 2.21624472 4.774511945 -4.87198610 O 
2 -0.18104835 4.100479091 6.97727175 1 
3 1.69712196 2.328894837 3.92445970 0 
4 1.65499099 2.462167830 0.74972168 O 
5 1.06797834 1.053091767 3.35380788 1 
6 0.67543296 1.918655276 1.56826805 1 
7 0.19982505 3.063870668 4.48912276 1 
8 0.91662531 1.953422065 3.29408509 1 
9 1.30843083 1.322800550 0.39717775 1 
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> data[, c("x", "x3")]; 

X X3 
1 2.21624472 -4.87198610 
2  -0.18104835 6.97727175 
3 1.69712196 3.92445970 
4 1.65499099 0.74972168 
5 1.06797834  3.35380788 
6 0.67543296 1.56826805 
7 0.19982505 4.48912276 
8 0.91662531 3.29408509 
9 1.30843083 0.39717775 
10 -0.12830745 3.63913066 
11 1.39507566 0.26466993 
12 2.21668825 3.69688978 
13 2.64020481 3.74926815 
14 -0.60394410 5.49485937 
15 0.49529219 2.72051420 
16  1.91349092 2.21675086 
17  1.33149648 7.09660419 
18 1.42607352 5.94312583 
19  2.93044162 2.27876092 
20  1.76600446 6.91145502 


You can select a variable using the $ sign, as stated in a previous 


chapter: 


» data$ 


X3; 


[1] -4.87198610 6.97727175 3.92445970 0.74972168 
3.35380788 1.56826805 4.48912276 

[8] 3.29408509 0.39717775 3.63913066 0.26466993 
3.69688978 3.74926815 5.49485937 

[15] 2.72051420 2.21675086 7.09660419 5.94312583 
2.27876092 6.91145502 7.10060931 
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[22] 4.62416860 3.12633172 5.63667497 0.37028080 
-0.11370995 2.27488863 0.43562110 

[29] 0.46417756 3.44465146 4.14409404 3.78561287 
1.86181693 8.10920939 0.87207093 

[36] 0.55297962 4.26909037 1.01777720 12.85624593 
4.79384178 -1.10646203 4.48442125 

[43] -3.56106951 1.71246170 9.74478236 3.15799853 
0.97278927 2.35670484 3.08804548 

[50] 1.52772318 -5.02155267 5.64303286 -1.24622282 
0.59864199 1.11359605 4.38302156 

[57] 2.54163230 1.19193935 -0.57096625 7.49237946 
6.88838713 5.37947543 0.72886289 

[64] 2.20441458 -0.04416262 6.98441537 5.25116254 
-0.15175665 -0.28652257 2.97419481 

[71] 1.57454520 1.74898024 3.78645063 1.02328701 
1.51030662 -1.46386054 5.65843587 

[78] 1.71775236 2.77004224 -0.13805983 6.51654242 
-0.80982223 6.55297343 3.65082015 

[85] 5.55579403 3.03684846 6.85138858 2.09051225 
2.79632315 5.21544351 2.63005598 

[92] -0.04795488 8.79812379 0.92166450 2.97840367 
1.89262722 2.23928744 2.46465216 

[99] -0.18871437 -0.14146813 


Sorting 


You can sort the data in x3 in ascending order using 


> data[order(data$x3), |]; 

X x2 X3 y 
51  0.95505576 1.796297183 -5.02155267 1 
1 2.21624472 4.774511945 -4.87198610 O 
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43 
76 
53 
41 
82 
59 
69 
99 


O O F e O 


. 71142379 
-43188297 
- 91399240 
.69012311 
-85499504 
. 14645562 
.02211069 


0.44030641 


.96007693 


100 2.34219633 


» data[order(data$x3, decreasing-TRUE), ] 


39 
45 
93 
34 


100 


PrPPN BN OO 


[| [| 
pi e O 


.06881756 


X 
.44100266 
45732586 
.48013836 
. 72284823 
- 24529387 
- 35331484 
- 33149648 
. 56980022 
. 18104835 
. 76600446 
.07591095 


0.29924083 
1.20403393 


.09110412 


.000455870 
. 758027392 
. 334445518 
-527698064 
623854599 
.629264301 
. 366558932 
.513103067 


1.705067556 


.314183829 


4.485035396 


OWN N A A ORA NU mnm 


X2 


.041863046 
.706789430 
.078239681 
.990169758 
.486295802 
. 124233637 
. 189856264 
. 379400632 
. 100479091 
.065779075 
.522409241 
.892653658 
.934698897 
./45501714 


EY 
N 


O^ OC OO Oc OO HD DN N N CO O WO 


.56106951 
.46386054 
.24622282 
. 10646203 
. 80982223 
- 57096625 
- 28652257 
. 18871437 
.15175665 
. 14146813 
.13805983 


X3 


-85624593 
- 74478236 
. 79812379 
. 10920939 
-49237946 
. 10060931 
.09660419 
. 98441537 
«97727175 
.91145502 
- 88838713 
- 85138858 
- 55297343 
.51654242 


e O O e O e be e O O O 


You can sort the data in x3 in descending order using 
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You can sort the data by multiple variables: 


> data[order(data$x3, data$x2), |; 


X 
51  0.95505576 
1 2.21624472 
43  0.77142379 
76 1.43188297 
53 | 1.91399240 
41 0.69012311 
82 0.85499504 
59  -0.14645562 
69 2.02211069 
99  0.44030641 
68 -0.96007693 
Filtering 


You can filter the data using Boolean expressions and statements: 


X2 


. 796297183 
~ 774511945 
.000455870 
. 758027392 
. 334445518 
«527698064 
.623854599 
.629264301 
. 366558932 
.513103067 
.705067556 


> data[data$x > 0, ]; 


O CON ADU A UU FR 


e e pe 
W N e 


X 


N N HP e O OO F FF H hM 


. 21624472 
.69712196 
65499099 
06797834 
67543296 
.19982505 
.91662531 
. 30843083 
.39507566 
.21668825 
. 64020481 


BRR OR RB WRB BNDNB D 


X2 


© 774511945 
. 328894837 
.462167830 
.053091767 
.918655276 
.063870668 
.953422065 
. 322800550 
.270269185 
611527264 
- 240357413 


I 
I 


UJ w O OUA FW QO UJ 


X3 
.02155267 
.87198610 
.56106951 
.46386054 
.24622282 
. 10646203 
. 80982223 
- 57096625 
- 28652257 
. 18871437 
.15175665 


X3 


-87198610 
«92445970 
- 74972168 
- 35380788 
- 56826805 
.48912276 
.29408509 
«39717775 
. 26466993 
.69688978 
. 74926815 


O O O e Pe e e RP O O O X 


O e OH e be O O O O be X 
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You can also filter the data with more complex expressions: 


> data[data$x > 0 & data$x < 1, |; 


O O O O O O O o © 


X 
-67543296 
- 19982505 
.91662531 
.49529219 
.96711785 
22931395 
.92245691 
. 26530020 
- 74641160 


X2 


.918655276 
.063870668 
.953422065 
.961328129 
- 656336500 
. 513268166 
.066787671 
.407519006 
.007386508 


O Uu Ou A NUV A Be 


. 5682681 
.4891228 
.2940851 
.7205142 
6241686 
. 6366750 
.4641776 
.7856129 
.8720709 


X3 


e e re PrP OR RPP —mc'« 


Removing Missing Values 


You can remove rows with NA values in any variables: 


> na.omit(data) ; 


N OW BW N Hn 
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X 


- 21624472 
. 18104835 
.69712196 
65499099 
.06797834 
67543296 
.19982505 


UJ e e N N A A 


X2 


«714511945 
. 100479091 
. 328894837 
.462167830 
.053091767 
-918655276 
-063870668 


X3 
- 87198610 
097727175 
-92445970 
- 74972168 
- 35380788 
- 56826805 
.48912276 


e e FP O O FF OX 


Removing Duplicates 


You can remove duplicates based on the x variable using 


> data[!duplicated(data$x), |; 


O CON Ou BP UU N e 


e e perpe 
UJ N e O 


14 
15 


X 


.21624472 
. 18104835 
.69712196 
65499099 
.06797834 
67543296 
.19982505 
.91662531 
. 30843083 
.12830745 
.39507566 
.21668825 
.64020481 
- 60394410 
0. 


49529219 


O e AÀA RP O U RP RP WwW RP RP nN XH No A 


X2 


.//4511945 
.100479091 
. 328894837 
.462167830 
.053091767 
-918655276 
-063870668 
.953422065 
.322800550 
.929044754 
.270269185 
.611527264 
- 240357413 
. 130285226 
- 961328129 


[| 
I 
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X3 


- 87198610 
Oi 127175 
-92445970 
- 74972168 
- 35380788 
- 56826805 
.48912276 
.29408509 
(39717775 
.63913066 
. 26466993 
.69688978 
. 74926815 
. 49485937 
. 72051420 


FOOOORKRHRBRBRBBO OR OK 


For more advanced duplicate removal for text data, you may use the 


Levenshtein similarity algorithm and others. It is beyond the scope of this 


book to cover the similarity duplicate removal. For each row of data, you 


can get the Levenshtein similarity between the current row and all other 


rows of the data and return the results with a similarity below a stated 


value. You can then check whether they are really similar and duplicated, 


and remove them as needed. 
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Some Basic Statistics Terms 


The following are some of the most popular terms used in statistics. 


e Population: Population is the total set of observations. 
A population is the whole, and it comprises every 


member or observation. 


e Sample: A sample is a portion of a population. A 
sample can be extracted from the population using 
random sampling techniques and others. 


e Observations: An observation is something you 
measure or count during a study or experiment. An 


observation usually means the row in data. 


e Variables: A variable is a characteristics or quantity 
that can be counted and can be called a data item. 
Variables usually refer to the column. 


Types of Data 


Data can be numeric data or categorical data. 


e Numeric data can have discrete data or continuous 
data. Discrete variables usually take integer values. 
Discrete variables have steps. Continuous variables can 


be any real number values. 


e Categorical data are categories or non-numeric data. 
Categorical data can have nominal variables that have 
unordered categories. Ordinal variables can have 
ordered categories. 
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Mode, Median, Mean 


Mean, median, and mode are the most common measures for central 
tendency. Central tendency is a measure that best summarizes the data 
and is a measure that is related to the center of the data set. 


Mode 


Mode is a value in data that has the highest frequency and is useful when 
the differences are non-numeric and seldom occur. 


To get the mode in R, you start with data: 


> A <- c(1, 2, 35 4, 5, 5, 5, 6, l, 8); 
To get mode in a vector, you create a frequency table: 
> y <- table(A); 
> Y; 
A 
1234567 8 
11113111 


You want to get the highest frequency, so you use the following to get 
the mode: 


> names(y)[which(y==max(y))]; 
[1] "5" 


Let’s get the mode for a data set. First, you have the data: 


> y <- table(data$x) ; 
> print(y); 
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To get the mode in the data set, you must get the values of the highest 
frequency. Since all values are of frequency 1, your mode is 


? 


names(y) [which(y==max(y)) ]; 


[1] "-1.10380693563466" 
"= 1.03307520058884" 

[5] 
"-0 

[9] 
"-0. 
[13] 
"-0.101546995549492" 
[17] "-0.04988355797068" 
"0.194009711645882" "OQ 
[21] "0.229313954917265" 
"0.278608090170949" 
[25] "0.375685875108926" 
"0.431165605395961" 
[29] "0.441002661190442" 
"0.495292193747488" 
[33] "0.675432957349429" 
"0.736279647441046" 
[37] '"0.760336369473898" 
"0.817889767078534" 
[41] "0.913440165921528" 
"0.922456906688953" 
[45] "0.967117849282085" 
"0.981039549396608" 


.576961817793324" ^" 


181048347962983" ^" 


"-0. 
"-0.681376413558189" 
-0 
"-0.393748947329461" 
-0. 
"-0.128307452033018" 
"-0.0911041245433437" 


"o. 


"O. 


"o. 


"O. 


"o. 


"O. 


"4. 


" -1.07591095084206" 
960076927640974" 

"-0.60394409792254" 
.531319073429502" 

"-0.363177568600528" 
146455623720039" 
"-0.123388723621058" 


"0.072199258362634" 


.199825048055373" 


"0.265300204243802" 
299240831235205" 
"0.375886351624522" 
440306411318609" 
"0.457325864971707" 
536273330307632" 
"0.690123106678726" 
746411596302334" 
"0.771423785187196" 
854995040060911" 
"0.916625307876996" 
955055760366265" 
"0.971724620868089" 
00606498603125" 
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[49] "1.06718039898877" 
"1.06881756289549" 
[53] "1.10356948706272" 
"1.2040339292453" 
[57] "1.28679297901113" 
"1.30843083022333" 
[61] "1.35331484423307" 
"1.40212267908162" 
[65] "1.43188296915663" 
"1.55472504319324" 
[69] "1.65307570138401" 
"1.69712196084468" 
[73] "1.72661763645561" 
"1.76600446229059" 
[77] "1.91349091681916" 

"1.92524009507999" "2 
[81] "2.03853797041063" 

"2.12978282462908" 
[85] "2.21668824966669" 
"2.33237405387913" 
[89] "2.48013835527471" 

"2.62246164503412" "2 
[93] "2.64020480708033" 

"2.72668117335907" 
[97] "2.93044161969506" 
"3.12515738750874" 
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"1.06797833714856" 
08718603297871" 
"1.11925179073556" 
25502822371368" 
"1.30790288496801" 
33149647893459" 
"1.39507565734447" 
42607352487406" 
"1.52346751351768" 
56980021618683" 
"1.65499098668696" 
72284823029622" 
"1.75109907720939" 
88048596823981" 
"1.91399240309242" 


.02211068696315" 


"2.04518626417935" 
21624471542922" 

"2.24529386752873" 
34219633417921" 

"2.50480340549357" 


.62566065094827" 


"2.68169620257743" 
77811502414916" 

"3.0710000206091" 
16181857813796" 
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Median 


The median is the middle or midpoint of the data and is also the 50 
percentile of the data. The median is affected by the outliers and skewness 
of the data. The median can be a better measurement for centrality than 
the mean if the data is skewed. The mean is the average, which is liable to 
be influenced by outliers, so median is a better measure when the data is 
skewed. 

In R, to get the median, you use the median( ) function: 


> A <- c(1, 2, 3, 45 5, 5, 5, 6, 7, 8); 
> median(A); 
[1] 5 


To get the median for dataset: 
> median(data$x2); 


[1] 2.380852 


Mean 


The mean is the average of the data. It is the sum of all data divided by 
the number of data points. The mean works best if the data is distributed 
in a normal distribution or distributed evenly. The mean represents the 


expected value if the distribution is random. 
1 Y x NIIT 


NC N 


In R, to get the mean, you can use the mean( ) function: 


> A <- c(1, 25 35 4, 5; 5, 5, 6, l, 8); 
> mean(A); 
[1] 4.6 
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To get the mean of a data set: 


> mean(data$x2); 
[1] 2.46451 


Interquartile Range, Variance, Standard 
Deviation 


Measures of variability are the measures of the spread of the data. 
Measures of variability can be range, interquartile range, variance, 


standard deviation, and more. 


Range 


The range is the difference between the largest and smallest points in the 
data. 
To find the range in R, you use the range() function: 


> A <= c(1, 25 35 4, 55 5, 5, 6, l, 8); 
> range(A); 
[1] 1 8 


To get the difference between the max and the min, you can use 


> A <- c(1, 2, 3, 4 5, 5, 5, 6, 7, 8); 
> res <- range(A); 

> diff (res); 

[1] 7 


You can use the min() and max() functions to find the range also: 


»A«-c(1, 2, 3, 4, 5, 5, 5, 6, 7, 8); 
» min(A); 

[1] 1 

» max(A); 
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[1] 8 
> max(A) - min(A); 
[1] 7 


To get the range for a data set: 


> res <- range(data$x2) ; 

> diff(res); 

[1] 10.65222 

> res <- range(data$x2, na.rm=TRUE) ; 
> diff(res); 

[1] 10.65222 


na.rmis a logical value to state whether to remove NA values. 


Interquartile Range 


The interquartile range is the measure of the difference between the 75 
percentile or third quartile and the 25 percentile or first quartile. 
To get the interquartile range, you can use the IOR() function: 


> A <- c(1, 2; 35 4, 55 35 5, 6, "n 8); 
» IOR(A); 
[1] 2.5 


You can get the quartiles by using the quantile() function: 


» quantile(A); 
0% 25/5 50% 75% 100% 
1.00 3.25 5.00 5.75 8.00 
You can get the 25 and 75 percentiles: 
> quantile(A, 0.25); 
25% 
3.25 
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» quantile(A, 0.75); 
754 
5.75 


You can get the interquartile range and quartiles for data set using 


> quantile(data$x2); 
0% 25% 50% 75% 
-2.298551 1.274672 2.380852 3.750422 
100% 
8.353669 
> IOR(data$x2) ; 
[1] 2.47575 
> IOR(data$x2, na.rm=TRUE) ; 
[1] 2.47575 
> help(IOR); 


The IOR() and quantile() functions can have NA values removed 
using na.rm = TRUE. 
Range measures the maximum and minimum data values, and the 


interquartile range measures where the majority value is. 


Variance 


The variance is the average of squared differences from the mean, and it is 
used to measure the spreadness of the data. 
The variance of a population is 


where p is the mean of the population and N is the number of data 


points. 
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The variance of a sample is 


where n is the number of data points. You use n-1 for sample variance 
and sample standard deviation because of Bessel's correction to partially 
correct the bias on the estimation of the population variance and standard 
deviation. 

To find the population variance, you use the var() function and 
(N - 1)/N: 


»A«-c(1,2, 3, 4, 5, 5, 5, 6, 7, 8); 
> N <- length(A); 

> N; 

[1] 10 

> var(A) * (N- 1) / N; 

[1] 4.24 


To get the sample variance, you can use the var( ) function: 


> € 615425: 35 Ay 554 55.5, Gy 7s 8); 
» var(A); 
[1] 4.711111 


To get the population variance of a data set: 


» N «- nrow(data); 

> N; 

[1] 100 

> var(data$x) * (N - 1) / Nj 
[1] 1.062619 
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To get the sample variance of a data set: 


> var (data$x) ; 
[1] 1.073352 


Standard Deviation 


The standard deviation is the square root of a variance and it measures the 
spread of the data. Variances get bigger when there are more variations 
and get smaller when there are lesser variations, because the variance is a 
squared result. With standard deviation, the variance is the square root, so 
it is easier to picture and apply. The variance is the squared result, so the 
unit is different from the data. The standard deviation has the same unit as 
the data. 

The population standard deviation is 





For the sample standard deviation, you usen - 1 for sample variance 
and sample standard deviation because of Bessel’s correction to partially 
correct the bias on the estimation of population variance and standard 


deviation. 





To find the population standard deviation: 


> A <- c(1, 2, 3, 4, 5, 5, 5, 6, 7, 8)j 
> N <- length(A); 

> variance <- var(A) * (N - 1) / N; 

» sqrt(variance); 

[1] 2.059126 


114 


CHAPTER 4 DESCRIPTIVE STATISTICS 
To find the population standard deviation for a data set: 


> N <- nrow(data); 

> variance <- var(data$x2) * (N - 1) / Nj 
> sqrt(variance) ; 

[1] 1.908994 


To find the sample standard deviation, you use the sd() function: 
> A <- c(1, 2; 35 4, 5, 5; 35 6, l, 8); 
> sd(A); 
[1] 2.170509 


To find the sample standard deviation of a data set, you use the sd() 


function: 


> sd(data$x2); 
[1] 1.918611 


Normal Distribution 


Normal distribution is one of the more important theories because nearly 
all statistical tests require the data to be distributed normally. It describes 
how data looks when plotted. Normal distribution is also called the bell 
curve, shown in Figure 4-8. 






34% : 


-2SD -1SD mean 1SD 2SD 


Figure 4-8. Normal Distribution or Bell Curve 
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You can plot a distribution in R using the hist() function: 


> hist(data$x, breaks=15); 

In R, breaks shows the number of bars in a histogram. bins is the class 
interval to sort the data. For example, in grade data, the bins can be 83-88, 
89-94, 95-100, and each bin size should be the same. See Figure 4-9. 


Histogram of data$x 


Frequency 
10 








»1 0 1 2 3 
data$x 


Figure 4-9. Histogram of x 


To see whether data is normally distributed, you can use the qqnorm( ) 
and qqline() functions: 


> qqnorm(data$x); 
> qqline(data$x); 


In the Q-Q plot shown in Figure 4-10, if the points do not deviate away 
from the line, the data is normally distributed. 
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Normal Q-Q Plot 


Sample Quantiles 
1 





Theoretical Quantiles 


Figure 4-10. QQPlot of x 


You can also use a Shapiro Test to test whether the data is normally 
distributed: 


» shapiro.test(data$x); 
Shapiro-Wilk normality test 

data: data$x 

W - 0.98698, p-value - 0.4363 


Ifthe p-value is more than 0.05, you can conclude that the data does 
not deviate from normal distribution. 

In R, to generate random numbers from the normal distribution, you 
use rnorm() function: 


» set.seed(123); 
> A <- rnorm(50, 3, 0.5); 
» hist(A, breaks-15); 
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3 is the mean and 0.5 is the standard deviation. In the above functions, 
you generated 50 random values from normal distribution. See Figure 4-11. 


Histogram of A 


Frequency 
a 





Figure 4-11. Histogram of A 


In R, to calculate the cumulative distribution function (CDF), F(x) = 
P(X <= x) where X is normal, you use the pnorm() function: 


> pnorm(1.9, 3, 0.5); 
[1] 0.01390345 


The above is a direct lookup for the probability P(X < 1.9) where X isa 
normal distribution with mean of 3 and standard deviation of 0.5. If you 
want P(X > 1.9), you use 1 - pnorm(1.9, 3, 0.5). 

In R, if you want to calculate the inverse CDF and lookup for the p-th 
quantile of the normal distribution, you use 


> qnorm(0.95, 3, 0.5); 
[1] 3.822427 


This code looks for the 95 percentile of the normal distribution with a 
standard deviation of 0.5 and a mean of 3. The value returned is an x value, 


not a probability. 
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Modality 


The modality of a distribution can be seen by the number of peaks when 
we plot the histogram (see Figure 4-12): 


» hist(data$x, breaks-15); 


Histogram of data$x 


Frequency 
10 15 


5 
| 





-1 0 1 2 3 
dataSx 


Figure 4-12. Histogram of x 


Figure 4-13 shows the modality type. The distribution of the market 
variable can be argued as a unimodal type. The figure shows the unimodal, 
bimodal, and multimodal types. 


Unimodal Bimodal Multimodal 


Figure 4-13. Modality Type of Histogram or distribution 


Skewness 


Skewness is a measure of how symmetric a distribution is and how much 
the distribution is different from the normal distribution. Figure 4-14 
shows the types of skewness. 
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Positive Skew Symmetrical 


Negative Skew 


Figure 4-14. Types of Skewness 


Negative skew is also known as left skewed, and positive skew is also 
known as right skewed. The histogram from the previous section has a 
negative skew. 

The Kurtosis measure is used to see whether a dataset is heavy tailed or 
light tailed. High kurtosis means heavy tailed, so there are more outliers in 
the data. See Figure 4-15. 


Normal Distribution Heavy Tailed Light Tailed 


ON 6a LÀ 
/ 
f 





Figure 4-15. Kurtosis Type or heavy tailed or light tailed distribution 


To find the kurtosis and skewness in R, you must install the moments 
package: 


> install.packages ("moments"); 

Installing package into 'C:/Users/gohmi/Documents/R/win- 
library/3.5' 

(as 'lib' is unspecified) 

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
moments O.14.zip' 
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Content type 'application/zip' length 55827 bytes (54 KB) 
downloaded 54 KB 


package 'moments' successfully unpacked and MD5 sums checked 


The downloaded binary packages are in 
C: NUsersNgohmiNAppDataNLocalNTempNRtmpET26HvNdownloaded | 
packages 


You also need the moments package: 


» require(moments); 
Loading required package: moments 


You then use the skewness() and kurtosis() functions to get the 


skewness and kurtosis: 


» skewness(data$x); 
[1] -0.06331548 

» kurtosis(data$x); 
[1] 2.401046 


Binomial Distribution 


Binomial distribution has two outcomes, success or failure, and can 
be thought of as the probability of success or failure in a survey that is 
repeated various times. The number of observations is fixed, and each 
observation or probability is independent, and the probability of success is 
the same for all observations. 

To get the probability mass function, Pr(X-x), of binomial distribution, 
you can use the dbinom() function: 


» dbinom(32, 100, 0.5); 
[1] 0.000112817 
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This code lookup is for P(X=30) where X is the binomial distribution 
with a size of 100 and a probability of success of 0.5 

To get the cumulative distribution function, P(X <= x), of a binomial 
distribution, you can use the pbinom() function: 


» pbinom(32, 100, 0.5); 
[1] 0.0002043886 


The above code lookup is for p(X <= 30) where X is the binomial 
distribution with a size of 100 and a probability of success of 0.5. 

To get the p-th quantile of the binomial distribution, you can use the 
qbinom() function: 


> gbinom(0.3, 100, 0.5); 
[1] 47 


The above code lookup is for the 30th quantile of the binomial 
distribution where the size is 100 and the probability of success is 0.5. The 
value is a cumulative value. 

To generate random variables from a binomial distribution, you use 
the rbinom() function: 


» set.seed(123); 
> A <- rbinom(1000, 100, 0.5); 
» hist(A, breaks-20); 


You can use rbinom() or rnorm() to generate random variables to 
simulate a new dataset. 
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Histogram of A 


Frequency 





35 40 45 50 55 60 65 


Figure 4-16. Histogram of A 


The summary() and str() Functions 


The summary() and str() functions are the fastest ways to get descriptive 
statistics of the data. The summary() function gives the basic descriptive 
statistics of the data. The str() function, as mentioned in a previous 
chapter, gives the structure of the variables. 

You can get the basic descriptive statistics using the summary ( ) function: 


» summary(data); 


X x2 x3 
y 
Min. :-1.1038 Min. :-2.299 Min. : -5.0216 
Min. :0.00 


1st Qu.: 0.3758 1st Qu.: 1.275 1st Qu.: 0.8415 
1st Qu.:0.00 
Median : 1.0684 Median : 2.381 Median : 2.5858 
Median :0.00 
Mean : 1.0904 Mean : 2.465 Mean : 2.8314 
Mean :0.49 
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3rd Qu.: 1.7946 3rd Qu.: 3.750 3rd Qu.: 4.5229 
3rd Qu.:1.00 
Max. : 3.1618 Max. : 8.354 Max. :12.8562 
Max. 11,00 


You can get the structure of the data using the str() function: 


» str(data); 

'data.frame': 100 obs. of 4 variables: 
$x: num 2.216 -0.181 1.697 1.655 1.068 ... 
$ x2: num 4.77 4.1 2.33 2.46 1.05 ... 

$ x3: num -4.87 6.98 3.92 0.75 3.35 ... 
$y: nm 010011114141... 


Conclusion 


In this chapter, you looked into R programming. You now understand 
descriptive statistics. Descriptive statistics summarizes the data and 
usually focuses on the distribution, the central tendency, and the 
dispersion of the data. 

You also looked into how R programming allows you to import a data 
set that can be a CSV file, Excel file, tab-separated file, JSON file, and 
others. Reading data into the R console or R is important because you must 
have some data before you can do statistical computing and understand 
the data. 

You performed simple data processing like selecting data, sorting data, 
filtering data, getting unique values, and removing missing values. 

You also learned some basic statistics terms such as population, 
sample, observations, and variables. You also learned about data types in 
statistics, which includes numeric data and categorical variables. Numeric 
data can have discrete and continuous data, and categorical data can have 


nominal and ordinal variables. 
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You also learned about mean, median, and mode, which are the most 
common measures for central tendency. Central tendency is a measure 
that best summarizes the data and is related to the center of the data set. 

You also learned about measures of variability, which is the measure of 
the spread of the data. Measures of variability can be range, interquartile 
range, variance, standard deviation, and others. 

You also learned about normal distribution, which is one of the more 
important theories because nearly all statistical testing requires the data 
to be distributed normally. You learned how to test normal distribution in 
data and the skewness and kurtosis of distributions. 

You also learned about the summary() and str() functions, which are 
the fastest ways to get the descriptive statistics of the data. The summary() 
function gives the basic descriptive statistics of the data. The str() function, 
as mentioned in a previous chapter, gives the structure of the variables. 

You learned more about two very popular distributions, where normal 
distribution is also known as bell curve and is a distribution that happens 
naturally in many data sets. Binomial distribution has two outcomes, 
success or failure, and can be thought of as the probability of success or 


failure in a survey that is repeated various times. 
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Data Visualizations 


Descriptive statistics is a set of math used to summarize data. Data 
visualization is the equivalent of visual communication because creating 
graphics from data helps us understand the data. Humans distinguish 
differences in line, shape, and color without much processing effort, and 
data visualization can take advantage of this to create charts and graphs to 
help us understand the data more easily. 

In this chapter, you will plot a bar chart, histogram, line chart, pie 
chart, scatterplot, boxplot, and scatterplot matrix using R. You will then 
look into plotting decision trees and social network analysis graphs. You 
will also use ggplot2 to create more advanced charts using the grammar 
of graphics, and then you will look into creating interactive charts using 
Plotly JS. 


What Are Data Visualizations? 


Descriptive statistics summarizes the data and usually focuses on the 
distribution, the central tendency, and the dispersion of the data. Data 
visualization, on the other hand, creates graphics from the data to help 
us understand it. The graphics or charts can also help us to communicate 
visually to our clients. A picture is worth a thousand words. 

Data visualization can involve the plotting of a bar, histogram, 
scatterplot, boxplot, line chart, time series, and scatterplot matrix chart to 
help us analyze and reason about the data and understand the causality 


and relationship between variables. Data visualization can also be viewed 
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as descriptive statistics to some. Humans distinguish differences in line, 
shape, and color without much processing effort, and data visualization 
can take advantage of this to create charts and graphs to help users 
understand the data more easily. 


Bar Chart and Histogram 


R programming allows us to plot bar charts and histograms. A bar chart 
represents data using bars, with y values being the value of the variable. R 
programming uses the barplot() function to create bar charts, and R can 
draw both horizontal and vertical bar charts. A histogram, on the other 
hand, represents the frequencies of the values within a variable and draws 
them into bars. 

To plot a bar chart in R, you can use the barplot() function: 


> data «- c(4, 6, 7, 9, 10, 20, 12, 8); 
> barplot(data, xlab-"X-axis", ylab="Y-axis", main-"Bar Chart 1", 
col-"green"); 


data is the data to plot, xlab is the x-axis name, ylab is the y-axis name, 
main is the main title, and col is the color of the chart. See Figure 5-1. 


bar chart 1 


X-axis 


y-axis 
15 20 


5 





Figure 5-1. Bar Chart of data 
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To export the bar chart into an image file (see Figure 5-2), you can 
add in 


> data <- c(4, 6, 7, 9, 10, 20, 12, 8); 
> png(file="D:/barchart.png") ; 
> barplot(data, xlab="x-axis", ylab="y-axis", main="bar chart 1", 
col="green") ; 
> dev.off(); 
RStudioGD 
2 


Photos - barchart.png 
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Figure 5-2. Bar Chart PNG file Opened with Microsoft Photos 


To plot a horizontal bar chart (see Figure 5-3), you can use horiz=TRUE: 


> data <- c(4, 6, 7, 9, 10, 20, 12, 8); 
> png(file="D:/barchart.png"); 


131 


CHAPTER 5 DATA VISUALIZATIONS 


> barplot(data, xlab="x-axis", ylab="y-axis", main="bar chart 
1", col="green", horiz=TRUE) ; 
> dev.off(); 
RStudioGD 
2 


Photos - bancharl peg 


nmm Supe all photos [2] Add to a creatha 
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Figure 5-3. Horizontal Bar Chart of data variable 


To plot a stacked bar plot, you can create the following data set: 


> data(mtcars); 
> data <- table(mtcars$gear, mtcars$carb) ; 
> data; 


123468 
3343500 
4440400 
5020111 
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To plot a stacked bar plot: 


> png(file="D:/barchart.png") ; 
> barplot(data3, xlab="x-axis", ylab="y-axis", main="bar chart 
1", col=c("grey", "blue", "yellow")); 
» dev.off(); 
RStudioGD 
2 


In the data, 3 is the grey color, 4 is the blue color, and 5 is the yellow 
color. When the x-axis or x is 1, the grey color is 3 steps, the blue color is 4 
steps, and the yellow color is 0 steps. See Figure 5-4. 


» data; 


123468 
3343500 
4440400 
5020111 


barchart. png - Photos 


m See all photos le) Add to an album 


bar chart 1 





Figure 5-4. Stacked bar plot of data 
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To plot a grouped bar chart, you can use beside=TRUE: 


> data(mtcars); 
> data <- table(mtcars$gear, mtcars$carb) ; 
> data; 


123468 
3343500 
4440400 
5020111 
> png(file="D:/barchart.png") ; 
> barplot(data3, xlab="x-axis", ylab="y-axis", main="bar chart 1", 
col=c("grey", "blue", "yellow"), beside=TRUE) ; 
> dev.off(); 
RStudioGD 
2 


In the data, 3 is the grey color, 4 is the blue color, and 5 is the yellow 
color. When the x-axis or x is 1, the grey color is 3 steps, the blue color is 4 
steps, and the yellow color is 0 steps. See Figure 5-5. 


> data; 


123468 
3343500 
4440400 
5020111 
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barchart.png - Photos 
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Figure 5-5. Grouped bar chart of data 


To plot a histogram, you can use the hist() function (see Figure 5-6): 


» set.seed(123); 
> data1 <- rnorm(100, mean-5, sd-3); 
» png(file-"D:/histogram.png"); 
> hist(data1, main-"histogram", xlab="x-axis", col-"green", 
border-"blue", breaks-10); 
» dev.off(); 
RStudioGD 
2 
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histogram png - Photos 


histogram 





Figure 5-6. Histogram of datal 


data1 is the data, main is the main title, col is the color, border is the 
color of the borders, xlab is the name of the x-axis, and breaks is the width 
of each bar. 

To plot a histogram with a density line, you can change freq-FALSE 
so that the histogram is plotted based on probability, and you can use the 
lines() function to add the density line (see Figure 5-7): 


» set.seed(123); 

> data1 <- rnorm(100, mean-5, sd-3); 

> hist(data1, main-"histogram", xlab="x-axis", col-"green", 
border-"blue", breaks-10, freq-FALSE); 

> lines(density(data1), col="red"); 
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histogram 


Density 





Figure 5-7. Histogram with density line of datal 


Line Chart and Pie Chart 


A line chart is a graph that has all the points connected together by 
drawing lines between them. A line chart is very useful for trend 
analysis and time series analysis. A pie chart, on the other hand, is the 
representation of data as slices of a circle with various colors. 


You can plot a line graph using the plot () function (see Figure 5-8): 


> xX <- 61s 2, 35 4, 5, 6, 8, 9); 
> y <- c(3, 5, 4, 6, 9, 8, 2, 1); 
> png(file="D:/line.png"); 
> plot(x, y, type="1", xlab="x-axis", ylab="y-axis", main="line 
graph", col="blue"); 
> dev.off(); 
RStudioGD 
2 
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Photos - line png 
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Figure 5-8. Line chart 


You use type- "1" when you want to plot a line chart and type-"p" when 
you want to plot a point chart or scatter chart. xlab is the x-axis name, ylab 
is the y-axis name, main is the main title, and col is the color of the chart. 

To plot multiple line graph, you can add in the lines() function: 


X <- c(1, 2, 3, 4, 5, 6, 8, 9); 

y <- c(3, 5, 4, 6, 9, 8, 2, 1); 
X <= 62,3, 4,6, 7, 8, 9, 10); 
y.1 <- c(6, 3, 5, 1, 5, 3, 4, 8); 
png(file="D:/line.png"); 


YV N N VC MM 


plot(x, y, type="1", xlab="x-axis", ylab="y-axis", main="line 
graph", col="blue"); 

> lines(x.1, y.1, type-"o", col-"green"); 

» dev.off(); 

RStudioGD 

2 
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type-"o" will give you a line graph with a point in it (see Figure 5-9). 











Figure 5-9. Multiple Line Chart 


To plot a pie chart, you can use the pie() function (see Figure 5-10): 


» X «- c(10, 30, 60, 10, 50); 
» labels «- c("one", "two", "three", "four", "five"); 
> png(file="D:/pie.png") ; 
> pie(x, labels, main="Pie Chart"); 
> dev.off(); 
RStudioGD 
2 
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Pie Chart 








Figure 5-10. Pie Chart 


To plot a 3D pie chart, you must install the plotrix library: 


> install.packages("plotrix"); 

Installing package into 'C:/Users/gohmi/Documents/R/win- 
library/3.5' 

(as 'lib' is unspecified) 

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
plotrix 3.7-3.zip' 

Content type 'application/zip' length 1055537 bytes (1.0 MB) 
downloaded 1.0 MB 


package 'plotrix' successfully unpacked and MD5 sums checked 


The downloaded binary packages are in 
C: NUsersNgohmiNAppDataNLocalNTempNRtmpET26HvNdownloaded | 
packages 
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You can use the require() function or the library() function to call 
the plotrix library: 


> library(plotrix) ; 


You can use the pie3D() function to plot the 3D pie chart shown in 
Figure 5-11: 


> x <- c(10, 30, 60, 10, 50); 
> labels «- c("one", "two", "three", "four", "five"); 

> png(file-"D:/pie.png"); 

> pie3D(x, labels-labels, explode=0.1, main="Pie Chart"); 
» dev.off(); 


RStudioGD 


Pie Chart 





Figure 5-11. 3D Pie Chart 


141 


CHAPTER 5 DATA VISUALIZATIONS 


Scatterplot and Boxplot 


A scatterplot is a chart that represents data using points in the Cartesian 
plane. Each point is the value of two variables. A boxplot shows the 
statistics of the data. 


You can plot a scatterplot using the plot() function: 


> xX <- c(1, 2, 35 4, 5, 6, 8, 9); 
> y <- c(3, 5, 4, 6, 9, 8, 2, 1); 
> png(file="D:/scatter.png"); 
> plot(x, y, xlab="x-axis", ylab="y-axis", main="scatterplot"); 
> dev.off(); 
RStudioGD 
2 


xlab is the x-axis name, ylab is the y-axis name, and main is the main 
title. See Figure 5-12. 


m See all photos [=] Add to a creation 
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Figure 5-12. Scatterplot 
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A boxplot represents how well the data in a data set is distributed, 
depicting the minimum, maximum, median, first quartile, and third 


quartile. Figure 5-13 explains the boxplot. 


Maximum 


7 


i Third Quartile 


Median 
IQR 


*— First Quartile 


< Minimum 


Figure 5-13. Boxplot and its explanation 


To create the boxplot in Figure 5-14, you can use the boxplot() function: 


set.seed(12); 

vari «- rnorm(100, mean-3, sd-3); 
var2 «- rnorm(100, mean-2, sd-2); 
var3 «- rnorm(100, mean-1, sd-3); 
data «- data.frame(vari, var2, var3); 
png(file="D:/boxplot.png") ; 
boxplot(data, main="boxplot", notch=FALSE, varwidth=TRUE, 
col=c("green", "purple", "blue")); 

> dev.off(); 

RStudioGD 

2 
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Figure 5-14. Boxplots 


data is the data, main is the main title, notch is a logical value to state 
how medians of different groups match with each other, varwidth is a 
logical value to state whether to draw the width of box proportionate to the 
sample size, and col is the color of the boxplot. 

You can draw a boxplot with a notch by setting notch=TRUE (see 
Figure 5-15): 


set.seed(12); 

vari <- rnorm(100, mean=3, sd=3); 
var2 «- rnorm(100, mean-2, sd-2); 
var3 «- rnorm(100, mean-1, sd-3); 


NV N N VM OM 


data <- data.frame(vari, var2, var3); 
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> png(file="D:/boxplot. png") ; 
> boxplot(data, main="boxplot", notch=TRUE, varwidth=TRUE, 
col=c("green", "purple", "blue")); 
» dev.off(); 
RStudioGD 
2 


Photos - boxplot.png 
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Figure 5-15. Boxplots with notch 
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Scatterplot Matrix 


A scatterplot matrix is used to find the correlation between a variable and 
other variables, and you usually use it to select the important variables, 
which is also known as variable selection. 

To plot the scatterplot matrix in Figure 5-16, you can use the pairs() 


function: 


set.seed(12); 

vari «- rnorm(100, mean-1, sd-3); 

var2 «- rnorm(100, mean-1, sd-3); 

var3 «- rnorm(100, mean-1, sd-3); 

var4 «- rnorm(100, mean-2, sd-3); 

var5 «- rnorm(100, mean-2, sd-3); 

data «- data.frame(vari, var2, var3, var4, var5); 
png(file="D:/scatterplotMatrix.png") ; 


V Mv N N N N MC MM 


pairs(~“var1+var2+var3+var4+var5, data=data, main="scatterplot 
matrix"); 

> dev.off(); 

RStudioGD 

2 
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Figure 5-16. Scatterplot Matrix 


Social Network Analysis Graph Basics 


A social network analysis graph is an advanced data visualization; it is not 
related to statistics, but can be good for reference. The following is a fast 
introduction to social network analysis graphs. A social network analysis 
graph can help you understand the relationships between individuals or 
nodes. Social network analysis is usually used on social network data like 
Facebook and Weibo. Each node is an individual, and the social network 
graphs show us how each individual connects to others. 
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To plot a social network analysis graph, you must install the igraph 
package: 


> install.packages("igraph") ; 

Installing package into 'C:/Users/gohmi/Documents/R/win- 
library/3.5' 

(as 'lib' is unspecified) 

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
igraph 1.2.2.zip' 

Content type 'application/zip' length 9148505 bytes (8.7 MB) 
downloaded 8.7 MB 


package 'igraph' successfully unpacked and MD5 sums checked 


The downloaded binary packages are in 
C: \Users\gohmi\AppData\Local\Temp\RtmpET26Hv\downloaded_ 
packages 


To use the igraph package, you must use the library() function: 
» library(igraph); 
Attaching package: 'igraph' 
The following objects are masked from 'package:stats': 
decompose, spectrum 
The following object is masked from 'package:base': 
union 


To plot the network graph for John > James > Mary > John shown in 
Figure 5-17, you use 


» g «- graph(edges-c("John", "James", "James", "Mary", "Mary", 
"John"), directed-FALSE); 
> plot(g); 


148 


CHAPTER 5 DATA VISUALIZATIONS 


Mey 


depts 


Xen 


Figure 5-17. Social Network Analysis with Undirected Network 
Graph 


To plot the directed network graph shown in Figure 5-18, you change 
directed-FALSE to directed-TRUE: 


» g «- graph(edges-c("John", "James", "James", "Mary", "Mary", 
"John"), directed-TRUE); 
> plot(g); 


Jas 


Xen 


Figure 5-18. Social Network Analysis with directed network graph 
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Using ggplot2 


ggplot2 is a package created by Hadley Wickham that offers a powerful 
graphics language for creating advanced charts. ggplot2 is very popular 
and famous in the R community, and it allows us to create charts that 
represent univariate, multivariate, and categorical data in a straightforward 
way. R’s build-in functionality offer the plotting of charts, but ggplot allows 
us to plot more advanced charts using the grammar of graphics. 

To use ggplot2, you must install the package: 


> install.packages("ggplot2") ; 

Installing package into 'C:/Users/gohmi/Documents/R/win- 
library/3.5' 

(as 'lib' is unspecified) 

also installing the dependencies 'stringi', 'colorspace', 
"stringr', ‘labeling’, 'munsell', 'RColorBrewer', ‘digest’, 
'gtable', 'lazyeval', 'plyr', 'reshape2', ‘scales’, 
'viridislLlite', 'withr' 


There is a binary version available but 
the source version is later: 
binary source needs compilation 
stringi 1.1.7 1.2.4 TRUE 


Binaries will be installed 
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
stringi 1.1.7.zip' 
Content type 'application/zip' length 14368013 bytes (13.7 MB) 
downloaded 13.7 MB 


trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
colorspace 1.3-2.zip' 

Content type 'application/zip' length 527776 bytes (515 KB) 
downloaded 515 KB 
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The downloaded binary packages are in 
C: \Users\gohmi\AppData\Local\Temp\RtmpET26Hv\downloaded_ 
packages 


What Is the Grammar of Graphics? 


ggplot2 focuses on the grammar of graphics, which is the building blocks 


of the chart, such as 


Data 

Aesthetic mapping 
Geometric objects 
Statistical transformation 
Scales 

Coordinate systems 
Position adjustments 


Faceting 


In this book, you are going to look into some of them. It is beyond this 


book to cover everything in ggplot2. 


The Setup for ggplot2 


To use ggplot2, you must call the library using the library() or require() 


functions. You also need to let ggplot know which data set to use, and you 


can use the ggplot () function: 


» library(ggplot2); 
» set.seed(12); 
> var1 <- rnorm(100, mean=1, sd-1); 


> var2 «- rnorm(100, mean=2, sd=1); 
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> var3 «- rnorm(100, mean-1, sd-2); 
» data «- data.frame(vari, var2, var3); 
» ggplot(data); 


ggplot() can only take a data frame variable. 


Aesthetic Mapping in ggplot2 


In ggplot2, aesthetics are the things we can see, such as 


e Position 


e Color 

e Fill 

e Shape 

e Line type 
e Size 


You can use aestheticsc in ggplot2 via the aes() function: 


> ggplot(data, aes(x-vari, y-var2)); 


Geometry in ggplot2 


Geometric objects are the plots or graphs you want to put in the chart. You 
can use geom point() to create a scatterplot, geom line() to create a line 
plot, and geom boxplot() to create a boxplot in the chart. 

You can see the available geometric objects (also shown in Figure 5-19) 
using 


» help.search("geom ", package-"ggplot2"); 


152 


CHAPTER 5 


Files Plots Packages Help Viewer 


ao @ 
R: Search Results * 


Help pages: 


gaplot2::geom bar 
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Bar charts 

Heatmap of 2d bin counts 

Draw nothing 

A box and whiskers plot (in the style of Tukey) 
2d contours of a 3d surface 

Count overlapping points 

Smoothed density estimates 
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Dot plot 

Horizontal error bars 
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Histograms and frequency polygons 

Jittered points 
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Polygons from a reference map 
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Figure 5-19. Available Geometric Objects 


In ggplot2, geom is also the layers of the chart. You can add in one 
geom object after another, just like adding one layer after another layer. 
You can add in a scatter plot (Figure 5-20) using the geom point() 


function: 


> ggplot(data, aes(x-vari, y=var2)) + geom 
point(aes(color-"red")); 
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Figure 5-20. Scatterplot 


You can add in a smoother that includes a line and a ribbon to the 
scatter plot (Figure 5-21) using another layer: 


> ggplot(data, aes(x-vari, y-var2)) + geom 


point(aes(color="red")) + geom smooth(); 


geom smooth() using method = 'loess' and formula 


y^x 


CHAPTER 5 DATA VISUALIZATIONS 


color 





var2 
^5 


^ red 


var1 


Figure 5-21. Scatterplot with Smoother 


Labels in ggplot2 


You have plotted the graphs in the charts, so now let's add in the main title 
and x- and y-axis titles. You can do this using the labs() layer to specify 
the labels. 

To add in the x-axis title, y-axis title, and the main title into Figure 5-22, 
you can use the labs() function: 


> ggplot(data, aes(x=var1, y=var2)) + geom point 
(aes(color="red")) + geom smooth() + labs(title="Scatter", 
x = "Xaxis", y = "Y-axis", color="Color"); 

"geom smooth()' using method = 'loess' and formula 'y ^ x' 
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Figure 5-22. Scatterplot with Smoother and labels 


Themes in ggplot2 


If you want to change the size and style of the labels and legends, the 
theme() function can help. The theme() function in ggplot handles 
elements such as 


e Axis labels 

e Plot background 

e Facet background 

e Legend appearance 


There are some built-in themes such as theme light() and theme bw(). 


You can add the built-in themes shown in Figure 5-23 by using theme -. 
light(): 


> ggplot(data, aes(x=var1, y=var2)) + geom point 
(aes(color="red")) + geom smooth() + labs(title-"scatter", 
x="x-axis", y="y-axis") + theme light(); 

"geom smooth()' using method = 'loess' and formula 'y ~ x' 
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Figure 5-23. Scatterplot with smoothers and labels and themes 


You can add in your own theme, as shown in Figure 5-24, using 


Vv 


ggplot(data, aes(x=var1, y=var2)) + geom 
point(aes(color="red")) + geom smooth() + 
labs(title="scatter", x="x-axis", y="y-axis") + theme(plot. 
title-element text(size-30, face="bold"), axis.text. 
x=element_text(size=15, face-"bold"), axis.text.y=element_ 
text(size=15, face="bold")); 

"geom smooth()' using method = 'loess' and formula 'y ^" x' 


plot.title is the title of the chart, axis.text.x is the x-axis label, 
axis.text.y is the y-axis label, axis.title.xis the title of x-axis, and 
axis.title.yisthe title of y-axis. To change the size of the text, use the 


element text() function. To remove the label, you can use element _ 
blank(). 
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Figure 5-24. Scatterplot with Smoother and customized themes 


ggplot2 Common Charts 


After learning the ggplot basics and the grammar of graphics, the following 
sections cover some common charts that can be plotted with ggplot. The 
code is available also. 


Bar Chart 


A bar chart (Figure 5-25) is used when you want to compare things 
between groups: 


>d <- c(1, 5, 8, 9, 8, 2, 1); 

» df «- data.frame(d); 

> ggplot(df) + geom bar(aes(color-"grey", x=d)) + 
labs(title="bar chart") + theme light(); 
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Figure 5-25. Bar Chart 
To plot a horizontal bar chart (Figure 5-26), you can use the coord _ 
flip() function: 


> ggplot(df) + geom bar(aes(color-"grey", x=d)) + coord flip() 
+ labs(title-"bar chart") + theme light(); 


bar chart 





Figure 5-26. Horizontal Bar Chart 
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Histogram 


A histogram (Figure 5-27) allows you to see whether the data is normally 
distributed. 


set.seed(12); 

vari <- rnorm(100, mean-1, sd=1); 
var2 <- rnorm(100, mean=2, sd=1); 
var3 <- rnorm(100, mean=1, sd=2); 
data «- data.frame(vari, var2, var3); 


NV N N VC MN ON 


ggplot(data, aes(x-vari)) + geom histogram(bins-10, 
color-"black", fill="grey") + labs(title-"histogram") + 
theme light(); 


histogram 


count 
e 


var1 


Figure 5-27. Histogram 
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Density Plot 


A density plot (Figure 5-28) can also show you whether the data is 


normally distributed. 


> ggplot(data, aes(x=var1)) + geom density(fill="grey") + 
labs(title-"density"); 


density 


density 


vari 


Figure 5-28. Density Plot 


Scatterplot 


A scatterplot (Figure 5-29) shows the relationships between two variables. 


> ggplot(data) + geom point(aes(color="red", x=var1, y-var2)) 
+ geom point(aes(color="green", x=var1, y=var3)) + 
labs(title="scatter") + theme light(); 
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Figure 5-29. Scatterplot 


Line chart 


A line chart (Figure 5-30) also shows the relationship between two 
variables and can also be used for trend analysis. 


> ggplot(data) + geom line(aes(color="red", x-vari, y-var2)) 
+ geom line(aes(color="green", x=var1, y-var3)) + 
labs(title="scatter") + theme light(); 
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Figure 5-30. Line Chart 


Boxplot 
A boxplot (Figure 5-31) shows the statistics of the data. 


> ggplot(data, aes(y=var2)) + geom boxplot(fill-"grey") + 
labs(title="boxplot") ; 
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Figure 5-31. Boxplot 


You can change the grids (Figure 5-32) using 


> ggplot(data, aes(x=var1)) + geom density(fill="grey") + 
labs(title="density") + theme(panel.background=element_ 
rect(fill="yellow"), panel.grid.major=element_ 
line(color="blue", size=2)); 
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Figure 5-32. Customised Themes 


You can change the background color using the panel. background 
and element rect(), and you can change the grid using panel.grid. 
major, panel.grid.minor, and element line(). 


To save a chart, you can use the ggsave() function (Figure 5-33): 


» ggsave("D:/density.png"); 
Saving 3.95 x 3.3 in image 
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Figure 5-33. Exported Chart to PNG file 


Interactive Charts with Plotly and ggplot2 


Plotly JS allows you to create interactive, publication-quality charts. You 
can create a Plotly chart using ggplot. To use Plotly or to create Plotly chart, 
you must download the plotly package as follows: 


> install.packages("plotly"); 

Installing package into 'C:/Users/gohmi/Documents/R/win- 
library/3.5' 

(as 'lib' is unspecified) 

also installing the dependencies 'httpuv', 'xtable', 
‘sourcetools', 'mime', ‘openssl’, 'yaml', 'shiny', ‘later’, 
httr', 'jsonlite', 'base64enc', 'htmltools', 'htmlwidgets', 
'tidyr', 'hexbin', 'crosstalk', 'data.table', ‘promises’ 
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There is a binary version available but 
the source version is later: 
binary source needs compilation 
htmlwidgets 1.2 1.3 FALSE 


trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
httpuv 1.4.5.zip' 

Content type 'application/zip' length 1182084 bytes (1.1 MB) 
downloaded 1.1 MB 


trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
xtable 1.8-3.zip' 

Content type 'application/zip' length 755944 bytes (738 KB) 
downloaded 738 KB 


trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
sourcetools 0.1.7.zip' 

Content type 'application/zip' length 530521 bytes (518 KB) 
downloaded 518 KB 


trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
mime O.5.zip' 

Content type 'application/zip' length 46959 bytes (45 KB) 
downloaded 45 KB 


trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
openssl 1.0.2.zip' 

Content type 'application/zip' length 3628608 bytes (3.5 MB) 
downloaded 3.5 MB 


trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 
yaml 2.2.0.zip' 

Content type 'application/zip' length 203553 bytes (198 KB) 
downloaded 198 KB 
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To use Plotly, you must call the library using the library() or 
require() functions: 


» library(plotly); 

Attaching package: 'plotly' 

The following object is masked from 'package:ggplot2': 
last plot 


The following object is masked from 'package:igraph': 
groups 

The following object is masked from 'package:stats': 
filter 

The following object is masked from 'package:graphics': 
layout 


To create a Plotly chart (Figure 5-34), you can use the ggplotly() 


function: 


set.seed(12); 

vari <- rnorm(100, mean-1, sd-1); 

var2 «- rnorm(100, mean-2, sd-1); 

var3 «- rnorm(100, mean=1, sd-2); 

data «- data.frame(vari, var2, var3); 

gg <- ggplot(data) + geom line(aes(x-vari, y-var2)); 
g <- ggplotly(gg); 

8; 
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Figure 5-34. Plotly Chart using R 


To save the Plotly chart, you must create a free Plotly account. Then 
you can use the following code to save the R session: 


Sys.setenv("plotly username"-"your plotly username"); 


Sys.setenv("plotly api key"-"your api key"); 
To publish the graph, you use 


api create(g, filename-""); 


Conclusion 


In this chapter, you looked into R programming. You now understand 
that descriptive statistics summarizes the data and usually focuses on the 
distribution, the central tendency, and the dispersion of the data. Data 
visualization, on the other hand, creates graphics from the data to help us 
understand the data. 
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You learned how R programming allows you to plot line charts, bar 
charts, histograms, scatterplots, scatterplot matrices, pie charts, and box 
plots. 

You also learned about decision trees, which are machine learning 
algorithms that perform both classification and regression tasks. They can 
be a graph to represent choices and results using a tree. 

You also learned about social network analysis graphs, which can 
help us understand the relationships between individuals or nodes. Social 
network analysis is usually used on social network data like Facebook and 
Weibo. Each node is an individual, and the social network graphs show us 
how each individual connects to others. 

You also explored ggplot2, a package created by Hadley Wickham 
that offers a powerful graphics language for creating of advanced charts. 
ggplot2 is very popular and famous in the R community, and it allows us to 
create charts that represents univariate, multivariate, and categorical data 
in a straightforward way. 

You also learned the grammar of graphics. ggplot2 focuses on the 
grammar of graphics. The grammar of graphics is the building blocks of a 
chart. 

You also explored Plotly JS. Plotly JS allows us to create interactive, 
publication-quality charts. You created a Plotly chart using ggplot. 
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CHAPTER 6 


Inferential Statistics 
and Regressions 


Inferential statistics and descriptive statistics are the main branches of 
statistics. Descriptive statistics derives a summary from the data set and 
makes use of central tendency, dispersion, and skewness. Inferential 
statistics describes and makes inferences about the population from 
the sampled data. In inferential statistics, you use hypothesis testing 
and estimating of parameters. Regression analysis is a set of statistical 
processes to estimate the relationships between all the variables. 

In this chapter, you will look into the apply(), lapply(), and 
sapply() functions and then you'll sample data and perform correlations 
and covariances plus tests such as p-value, t-test, chi-square test, and 
ANOVA. You will then look into non-parametric tests, which include the 
Wilcoxon signed rank test, Wilcoxon-Mann-Whitney test, the Kruskal-Wallis 


test, and simple linear regression and multiple linear regression analysis. 


What Are Inferential Statistics 
and Regressions? 


Inferential statistics and descriptive statistics are the two main branches of 
statistics. Descriptive statistics derives a summary from the data by using 
central tendencies like mean and median, dispersions like variance and 
standard deviation, and skewness and kurtosis. 
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Inferential statistics describes and makes inferences about the 
population and the sampled data. In inferential statistics, you use 
hypothesis testing and estimating of parameters. By estimating 
parameters, you try to answer the population parameters. In hypothesis 
testing, you try to answer a research question. In hypothesis testing, a 
research question is a hypothesis asked in question format. A research 
question can be, Is there a significant difference between the grades of class 
1 and class 2 for their engineering math exams? A hypothesis can be, There 
is a significant difference between the grades of class 1 and class 2 for their 
engineering math exams. The research question begins with Is there and 
the hypothesis begins with There is. Based on the research question, the 
hypothesis can be a null hypothesis, Ho, and an alternate hypothesis, H4. 
A null hypothesis, Ho, can be y, = u, and an alternate hypothesis, H,, can 
be u Z us. SO m is the mean of the grades of class 1 and u, is the mean of 
the grades of class 2. You can then use inference tests to get the p-value. If 
the p-value is less than or equal to alpha, which is usually 0.05, you reject 
the null hypothesis and say that the alternate hypothesis is true at the 95% 
confidence interval. If the p-value is more than 0.05, you fail to reject the 
null hypothesis. 

For estimating parameters, the parameters can be the mean, variance, 
standard deviation, and others. If you want to estimate the mean of heights 
of the whole population (and by the way, it is impossible to measure 
everyone in the population), you can use a sampling method to select 
some people from the population. Subsequently, you calculate the mean of 
the height of the samples and then make an inference on the mean of the 
height of the population. You can then construct the confidence intervals, 
which is the range in which the mean of the height of the population will 
fall. You construct a range because the sample cannot derive the exact 
mean ofthe height of the population. 
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Regression analysis is a set of statistical processes to estimate the 
relationships between all the variables. To be more specific, regression 
analysis is used to understand the relationships among independent 
variables and dependent variables and to explore the forms of the 
relationships. 


apply(), lapply(), sapply() 


The apply() function can perform a loop to go through the data and apply 
a function. The function can be the mean( ) function from R or it can be a 
customized function. The use of the apply() function is to avoid the use of 
loops. The apply() function can take list, matrix, or array. 

To use the apply() function, you create random data: 


» set.seed(123); 
> var1 <- rnorm(100, mean=2, sd=1); 
> var2 <- rnorm(100, mean=3, sd=1); 
> var3 <- rnorm(100, mean=3, sd=2); 
> data <- data.frame(var1, var2, var3); 
> data; 

vari var2 var3 
1 1.43952435 2.2895934 7.39762070 
2 1.76982251 3.2568837 5.62482595 
3 3.55870831 2.7533081 2.46970989 
4 2.07050839 2.6524574  4.08638812 
5 2.12928774 2.0483814 2.17132010 
6 3.71506499 2.9549723 2.04750621 
7 2.46091621 2.2150955 1.42279432 
8 0.73493877 1.3320581 1.81076547 
9 1.31314715 2.6197735 6.30181493 
10 1.55433803 3.9189966 2.89194375 
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You can use the apply() function with a mean() function: 


> apply(data, 1, 
[1] 3. 


[11] 


2 


2 
2 
1 
2 
4 
2 
2 
1 
2 
3. 
2 
2 
2 
2 
1 
3 
3 
4 


The above data is the result of the mean of every row. data is the data, 


708913 


.032935 
.962408 
5/3133 
.967292 
.554660 
. 593868 
- 242476 
. 143278 
.977803 
. 763284 


058612 


- 796705 
- 939493 
-994538 
. 700874 
. 202496 
«549599 
. 848239 
. 840582 


mean); 


3.550511 
1.292587 
3.151718 
1.315515 
2.690174 
2.265633 
2.658598 
3.172000 
24115163 
1.833860 
2. 
3 
1 
1 
2 
2 
3 
3 
3 
1 


538914 


. 157894 
.822920 
.352056 
. 560393 
.412955 
./07153 
.933870 
. 444024 
-859845 


2.927242 
3.411579 
3.082613 
1.793035 
2.901045 
2.556436 
3.121812 
3.417250 
2.718194 
3. 
2 
2 
2 
2 
3 
3 
2 
2 
3 
2 


183512 


. 533833 
.957849 
. 166982 
-887190 
.446763 
.158053 
683914 
- 663340 
- 541280 
. 398367 


2.936451 
2.788426 
2.340998 
1.311216 
2.810064 
2.383774 
2.326194 
1.537521 
2.126560 
2.533507 
Z4 
1 
4 
3 
2 
2 
2 
2 
1 
2 


847073 


.591852 
- 274287 
532488 
. 209567 
.484031 
415853 
4995215 
:d43932 
. 762643 


2.116330 2. 


Ls 


2s 


Ls 


2 


3. 


3 


2. 


2s 


4 


992851 


830503 


873185 


.416108 


617265 


699170 


156647 


593174 


4. 


905848 


479820 


.926952 


«395215 


. 383570 


.019648 


.898973 


345487 


. 800894 


.018272 3.176785 


1 is the margin, and mean is the function. The margin means that the 


function will be applied to a column when it is 2 and a row when it is 1. 


You can get the mean of each column by using margin=2: 


> apply(data, 2, mean); 


vari var2 var3 
2.090406 2.892453 3.240930 
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apply() and lapply() are very similar, just that the output of lapply() 
is a list. You can use lapply() as follows: 


> lapply(data$vari, mean); 


[[1]] 
[1] 1.439524 


[[4]] 
] 


[1] 2.070508 


[[5]] 
] 


[1] 2.129288 


sapply() is very similar to lapply(), just that the output is a vector: 


> sapply(data$vari, mean); 


[1] 
[8] 


[15 ] 


1 


2 
0 
2 
1 
2 
1 
0. 
0 
2 
2 
1 
0 
1 


.43952435 
.12928774 
. 73493877 
. 35981383 
-44415887 
e 70135590 
. 78202509 
31330669 
. 86186306 
-89512566 
. 68864025 
-61952900 
. 73460365 
- 59711516 


e A e N NY WN On WN FP WW B 


. 76982251 
. 71506499 
. 31314715 
.40077145 
.79691314 
.52720859 
-97399555 
-83778704 
. 25381492 
-87813349 
.55391765 
. 30529302 
.16895597 
.53334465 


N UJ RP RP NY NY NY RP QO NY N RFP NUW 


- 55870831 
.46091621 
.55433803 
.11068272 
.49785048 
.93217629 
427110577 
:1533/312 
.42646422 
.82158108 
.93808829 
. 79208272 
. 20796200 
.71996512 


2.07050839 


3.22408180 


0.03338284 


1.37496073 


1.70492852 


1.69403734 


0.87689142 
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[50] 1.91663093 
3. 36860228 
[57] 0.45124720 
2.37963948 
[64] 0.98142462 
2.05300423 
[71] 1.50896883 
1.31199138 
[78] 0.77928229 
2.38528040 
[85] 1.77951344 
1.67406841 
[92] 2.54839696 
1.39974041 
[99] 1.76429964 
Sampling 


Sampling is the selection of a subset of a population. The population is the 
data from every member. Sometimes a sample can be a subset from a full 
data set. The advantages of sampling are that the cost is lower and data 
collection is more efficient than collecting the data from every member in 


the population. 


2s 
Es 
25 
Ts 
0. 
2. 


-0. 


O A N W N FP N UJ 


25331851 
77422901 
58461375 
49767655 
92820877 
92226747 
30916888 


.02557137 
.18130348 
62933997 
. 33178196 
. 14880762 
. 23873174 
. 18733299 
«97357910 


UJ FP hN UJ NY RP RP UJ BPN RP NU H 


.97145324 
.51647060 
.12385424 
666792062 
.30352864 
.05008469 
.00573852 
. 71522699 
. 86110864 
-64437655 
.09683901 
.99350386 
.3/209392 
.53261063 


Simple Random Sampling 


Simple random sampling (SRS) selects elements from the full data set 
randomly. Each element has the same probability of being selected, and 
each subset has the same probability of being selected as other subsets. 
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Stratified Sampling 


Stratified sampling is when you divide the population into groups based on 
a factor or characteristics or some factors. These groups are called strata, 
and an individual group is called a stratum. In stratified sampling, you do 
the following: 


1. Divide the population into groups. 
2. Use simple random sampling on each group. 
3. Collect data from each sampling unit. 


Stratified sampling works well when a heterogeneous population is 
split into homogeneous groups. 


Cluster Sampling 


Cluster sampling is different from stratified sampling. It should be done in 
the following way: 


1. Divide the population into clusters. 


2. Userandom sampling on clusters from all possible 
clusters. 


3. Collect data from a sampling unit. 


Unlike stratified sampling, cluster sampling all members of the 
selected clusters forms a sample, while in stratified sampling, a sample 
is based on the random selection of members from all the strata. For 
example, say 13 colleges are the strata and clusters. In stratified sampling, 
30 students from each college or strata are selected using random 
sampling. In cluster sampling, all students in 5 out of 13 colleges are 
selected using random sampling. 
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You can do random sampling using the sample() function: 


set.seed(123); 

vari <- rnorm(100, mean-2, sd=1); 

var2 «- rnorm(100, mean-3, sd-1); 

var3 «- rnorm(100, mean-3, sd-2); 

data «- data.frame(vari, var2, var3); 

sample(data$var1, 5, replace-TRUE); 

1] 1.27110877 2.92226747 0.97399555 1.70492852 0.03338284 


mm Vv N N wv wv wv 


data$var1 is the data, 5 is the number of items to select from, 
and replace-TRUE means that the chosen item can be repeated. If 
replace-FALSE, the chosen item cannot be repeated. 

You can do stratified sampling using the dplyr library. You must install 
the dplyr library before using it: 


> install.packages("dplyr"); 

Installing package into 'C:/Users/gohmi/Documents/R/win- 
library/3.5' 

(as 'lib' is unspecified) 

also installing the dependencies 'fansi', 'utf8', 'bindr', 
‘cli’, ‘crayon’, ‘pillar’, 'purrr', 'assertthat', ‘bindrcpp', 
‘glue’, 'magrittr', 'pkgconfig', 'R6', 'Rcpp', 'rlang', 
"tibble', 'tidyselect', ‘BH’, 'plogr' 


The downloaded binary packages are in 
C: NUsersNgohmiNAppDataNLocalNTempNRtmpaizSiCNdownloaded 
packages 


You can load the iris data using 


» data(iris); 
» summary(iris); 
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Sepal.Length Sepal.Width Petal.Length Petal.Width Species 

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 
1st 0u.:5.100 1st Qu.:2.800 41st Qu.:1.600 1st Qu.:0.300  versicolor:50 
Median :5.800 Median :3.000 Median :4.350 Median :1.300  virginica :50 
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 

3rd 0u.:6.400 3rd Qu.:3.300 3rd 0u.:5.100 3rd Qu.:1.800 

Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 


There are 50 setosa data, 50 versicolor data, and 50 virginica data. 


You can load the dplyr library using 

> library(dplyr) ; 

Attaching package: 'dplyr' 

The following objects are masked from 'package:stats': 
filter, lag 

The following objects are masked from 'package:base': 


intersect, setdiff, setequal, union 
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You can do stratified clustering using dplyr: 


> iris sample <- iris %>% 
+ group by(Species) %>% 
+ sample n(13) 
» iris sample; 
# A tibble: 39 x 5 
# Groups: Species [3] 
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 


«dbl» «dbl» «dbl» «dbl» «fct» 
1 5 345 1.3 0.3 setosa 
2 5 3.4 1.5 0.2 setosa 
3 Dsl 3.4 1.5 0.2 setosa 
4 5.7 4.4 1.5 0.4 setosa 
5 5.1 3.5 1.4 0.3 setosa 
6 5.2 3.4 1.4 0.2 setosa 
7 5 3.6 1.4 0.2 setosa 
8 5.1 3.5 1.4 0.2 setosa 
9 4.5 2.3 1.3 0.3 setosa 
10 Sed 3.3 1.7 0.5 setosa 


# ... with 29 more rows 
» View(iris sample); 


sample n(13) selects 13 items from each group. group by(Species) 
means you group the data by species variable. See Figure 6-1. 
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Figure 6-1. Stratified Sampling - Selected 13 sample from each 


groups 


Correlations 


Correlations are statistical associations to find how close two variables 
are and to derive the linear relationships between them. In predictive 


^^ Filter 


Sepal.Length 
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Sepal.Width 


Petal.Length 


Petal.Width 


Species 


setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
setosa 
versicolor 
versicolor 
versicolor 
versicolor 
versicolor 


versicolor 
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analytics, you can use correlation to find which variables are more related 


to the target variable and use this to reduce the number of variables. 
Correlation does not mean a causal relationship. Correlation finds how 
close two variables are, but does not tell you the how and why of the 
relationship. Causation tells you that one variable change will cause 


another variable to change. 
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The formula for correlation is 





where (x, -x ) is the x value minus the mean and then square it, and 
(x, - Y) (v; - yy is the y value minus the mean and then square it. 
To get the correlation, you generate sample data first: 


View(iris sample); 

set.seed(123); 

vari <- rnorm(100, mean-2, sd=1); 
var2 «- rnorm(100, mean-3, sd-1); 
var3 «- rnorm(100, mean-3, sd-2); 
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data <- data.frame(var1, var2, var3); 
You can use the cor() function to get the correlation: 


> cor(data$vari, data$var2); 
[1] -0.04953215 


The correlation has a range from -1.0 to 1.0. When the correlation is 0, 
there is no correlation or relationship. When the correlation is more than 
0, it is a positive relationship. Positive correlation means that when one 
variable value increases, the other variable values also increase. When the 
correlation is less than tk, it is a negative relationship. Negative correlation 
means that when one variable increases, the other variables’ values 
decrease. 1 is the perfect positive correlation and -1 is the perfect negative 
correlation. Hence, the larger the value towards 1, or the smaller the values 
towards -1, the better the relationship. 

-0.04953215 means that the correlation is a negative relationship 
between var 1 and var2. The correlation is close to zero, so the relationship 
is not good. 
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Covariance 


Covariance is a measure of variability between two variables. The greater 
the value of one variable and the greater of other variable means it will 
result in a covariance that is positive. The greater value of one variable to 
the lesser value of the other variable will result in a negative covariance. 
Covariance shows the linear relationship between both variables, but 
the covariance magnitude is difficult to interpret. Correlation is the 


normalized version of covariance so 


where (x, — x) is the x value minus the mean and ( y, — y) is the y value 
minus the mean. 


To get a covariance, create sample data: 


set.seed(123); 

vari <- rnorm(100, mean-2, sd=1); 
var2 «- rnorm(100, mean-3, sd-1); 
var3 «- rnorm(100, mean-3, sd-2); 


V N N VM x~ 


data <- data.frame(var1, var2, var3); 
You can use the cov() function to get the covariance: 


> cov(data$var1, data$var2); 
[1] -0.04372107 


Correlation has a range of -1 to 1. Covariance does not have a range. 
Correlation is good for measuring how good the relationship between 
two variables is. When two variables have a positive covariance, when one 
variable increases, the other variable increases. When two variables have 


a negative covariance, when one variable increases, the other variable 
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decreases. When two variables are independent of each other, the covariance 

is zero. -0.04372107 means the covariance is negative, and it is very close to 

zero, so the relationship between the two variables is not very good. 
Correlation and covariance are usually within descriptive statistics. 


Hypothesis Testing and P-Value 


I mentioned hypothesis testing previously. In hypothesis testing, a research 
question is a hypothesis asked in question format. A research question can 
be, Is there a significant difference between something? A hypothesis can be, 
There is a significant difference between something. The research question 
begins with Is there and the hypothesis begins with There is. A hypothesis 
can also be a null hypothesis, Ho, and an alternate hypothesis, H,. You can 
write the null hypothesis and alternate hypothesis as follows: 


Ho: ui = Us 


H4 fy F Us 


where y; is the mean of one data, and n, is the mean of another data. 

You can use Statistical tests to get your p-value. You use a t-test for 
continuous variables or data, and you use a chi-square test for categorical 
variables or data. For more complex testing, you use ANOVA. If data is not 
normally distributed, you use non-parametric tests. A P-value helps you 
determine the significance of your statistical test results. Your claim in the 
test is known as a null hypothesis and the alternate hypothesis means that 


you believe the null hypothesis is untrue. 


e A small p-value <= alpha, which is usually 0.05, 
indicates that the observed data is sufficiently 
inconsistent with the null hypothesis, so the null 
hypothesis may be rejected. The alternate hypothesis is 
true at the 95% confidence interval. 
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e A larger p-value means that you failed to reject null 
hypothesis. 


T-Test 


A t-test is one of the more important tests in statistics. A t-test is used to 
determine whether the mean between two data points or samples are 
equal to each other. The null hypothesis means that the two means are 


equal, and the alternative means that the two means are different. 


Types of T-Tests 


Figure 6-2 shows the types of t-tests. 














Unequal 
Variance 


Independent 
(Uncorrelated) 





Equal 
Variance , 


Paired 
(Correlated) 


Figure 6-2. Types of T-Tests 
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Assumptions of T-Tests 


Here are the assumptions: 


e The samples are randomly sampled from their 
population. 


e The population is normally distributed. 


Type | and Type Il Errors 


A type I error is a rejection of the null hypothesis when it is really true. 
A type II error is a failure to reject a null hypothesis that is false. 


One-Sample T-Test 


A one-sample t-test is used to test whether the mean of a population is 
equal to a specified mean. 
The formula of a one-sample t-test is 





where s is the standard deviation of the sample, n is the size of the sample, 
mis the mean of the sample, and u is the specified mean. 
The degree of freedom formula is 


df =n-1 


You can use the t statistics and the degree of freedom to estimate the 
p-value using a t-table. 
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To use a one-sample t-test in R, you can use the t. test() function: 


set.seed(123); 

var1 <- rnorm(100, mean=2, sd=1); 
var2 <- rnorm(100, mean=3, sd=1); 
var3 <- rnorm(100, mean=3, sd=2); 
data «- data.frame(vari, var2, var3); 
t.test(data$vari, mu=0.6); 


YV N N VM ON MVM 


one sample t-test 


data: data$var1 
t = 16.328, df = 99, p-value « 2.2e-16 
alternative hypothesis: true mean is not equal to 0.6 
95 percent confidence interval: 
1.909283 2.271528 
sample estimates: 
mean of x 
2.090406 


In a one-sample t-test, 
Ho: H =m 
Ay u £m 


mis 0.6 in the above R code. The p-value is 2.2e-16, so the p-value is less 
than 0.05, which is the alpha value. Therefore, the null hypothesis may be 
rejected. The alternate hypothesis, u Z 0.6, is true at the 95% confidence 


interval. 
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Two-Sample Independent T-Test 


The two-sample unpaired t-test is when you compare two means of two 
independent samples. The formula of the two-sample independent t-test is 


J= H,— Hg 
2 2 

S S 
Nn, Mpg 


where 

Ha is the mean of one sample, 

Hp is the mean of the second sample, 

n, is the size of sample A, and 

Nz is the size of sample B. 

tk is the estimator of the common variance of the two samples, and the 
formula is 
pba) a) 


S = 
(E (Pa 


The degrees of freedom formula is 
df =n,—n, -2 
To use a two-sample unpaired t-test with a variance as equal in R: 


set.seed(123); 

vari <- rnorm(100, mean-2, sd=1); 

var2 <- rnorm(100, mean=3, sd=1); 

var3 <- rnorm(100, mean=3, sd=2); 

data <- data.frame(vari, var2, var3); 

t.test(data$vari, data$var2, var.equal=TRUE, paired-FALSE); 
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two-sample t-test 


data: data$vari and data$var2 
t = -6.0315, df = 198, p-value = 7.843e-09 
alternative hypothesis: true difference in means is not equal 
to 0 
95 percent confidence interval: 
-1.0642808 -0.5398138 
sample estimates: 
mean of x mean of y 
2.090406 2.892453 


The two-sample independent t-test: 


Ho: Ha — up = 0 
H4: ua — ug FO 


The p-value is 7.843e-09, so it is less than 0.05, which is the alpha value. 
Therefore, the null hypothesis may be rejected. The alternate hypothesis, 
Ha — ug #O, is true at the 95% confidence interval. 

In the two-sample unpaired t-test, when the variance is unequal, you 
use the Welch t-test. You can assume the two data variances are different, 
or you can calculate the variance of each data. The Welch t-test formula is 


as follows: 
(= Hy, — Hg 
Sa Sp 
n, Np 
where 


uais the mean of sample A, 
Hp is the mean of sample B, 
n, is the sample size of A, 
Nz is the sample size of B, 
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s, is the standard deviation of A, and 

Sz is the standard deviation of B. 

Unlike the normal t-test formula, the Welch t-test formula involves the 
variance of each sample. 

The degrees of freedom formula of the Welch t-test formula is as 


follows: 








To use the two-sample unpaired t-test with a variance as unequal in R: 


set.seed(123); 

vari <- rnorm(100, mean-2, sd=1); 

var2 «- rnorm(100, mean-3, sd-1); 

var3 «- rnorm(100, mean-3, sd-2); 

data «- data.frame(vari, var2, var3); 

t.test(data$vari, data$var2, var.equal-FALSE, paired-FALSE); 
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Welch two-sample t-test 


data: data$vari and data$var2 
t = -6.0315, df = 197.35, p-value = 7.88e-09 
alternative hypothesis: true difference in means is not equal 
to 0 
95 percent confidence interval: 
-1.0642862 -0.5398084 
sample estimates: 
mean of x mean of y 
2.090406 2.892453 
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The two-sample unpaired t-test: 
Ho: H4 — ug = 0 
H4: ua — ug FO 


The p-value is 7.88e-09, so it less than 0.05, which is the alpha value. 
Therefore, the null hypothesis may be rejected. The alternate hypothesis, 
Ha — ug # 0, is true at the 95% confidence interval. 


Two-Sample Dependent T-Test 


A two-sample paired t-test is used to test the mean of two samples that 
depend on each other. The t-test formula is 





d 
[= 
s’/n 

where 
d is the mean difference, 
sis the sample variance, and 
nis the sample size. 
The degree of freedom formula is 

df =n-1 


To use the two-sample paired t-test in R: 


set.seed(123); 

vari <- rnorm(100, mean-2, sd=1); 

var2 <- rnorm(100, mean=3, sd=1); 

var3 <- rnorm(100, mean=3, sd=2); 

data «- data.frame(vari, var2, var3); 
t.test(data$var1, data$var2, paired-TRUE); 
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Paired t-test 


data: data$vari and data$var2 
t = -5.8876, df = 99, p-value = 5.379e-08 
alternative hypothesis: true difference in means is not equal 
to 0 
95 percent confidence interval: 

-1.0723482 -0.5317464 
sample estimates: 
mean of the differences 

-0.8020473 


The two-sample paired t-test: 
Ho: Ha — ug = 0 
H4: uA — ug FO 


The p-value is 5.379e-08, so itis less than 0.05, which is the alpha value. 
Therefore, the null hypothesis may be rejected. The alternate hypothesis, 
Ha — ug # 0, is true at the 95% confidence interval. 


Chi-Square Test 


The chi-square test is used to compare the relationships between 
two categorical variables. The null hypothesis means that there is no 
relationship between the categorical variables. 


Goodness of Fit Test 


When you have only one categorical variable from a population and you 
want to compare whether the sample is consistent with a hypothesized 
distribution, you can use the goodness of fit test. 
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The null hypothesis means that the data followed a specified 
distribution, and the alternate hypothesis means that data doesn't follow 
the specified distribution. 

The goodness of fit formula is 


where O; is the observed frequency of bin i and EF; is the expected 
frequency of bin i. 


To calculated the expected frequency, the formula is 
E =N: p, 


where N is the total sample size and p;is the hypothesized proportion of 
the observations of bin i. 

To use the goodness of fit chi-square test in R, you can use the chisq. 
test() function: 


» data «- c(B=200, c=300, D=400); 
> chisq.test(data) ; 


Chi-squared test for given probabilities 


data: data 
X-squared = 66.667, df = 2, p-value = 3.338e-15 


The goodness of fit chi-square test: 


e H: No significant difference between the observed and 
expected values. 


e H,: There is a significant difference between the 
observed and expected values. 
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The p-value is 3.338e-15, so it is less than 0.05, which is the alpha value. 
Therefore, the null hypothesis may be rejected. The alternate hypothesis of 
a significant difference between the observed and expected values is true 


at the 95% confidence interval. 


Contingency Test 


If you have two categorical variables and you want to compare whether there 
is arelationship between two variables, you can use the contingency test. 
The null hypothesis means that the two categorical variables have 
no relationship. The alternate hypothesis means that the two categorical 
variables have a relationship. 
To calculate the expected value, use 


RC, 
E.=— 
” N 


where R is the row, Cis the column, N is the total, ith is the row, and jth is 
the column. 
The formula for X? statistics is 


To use a contingency test in R, create your data first: 


» var1 «- c("Male", "Female", "Male", "Female", "Male", 
"Female", "Male", "Female", "Male", "Female"); 

> var2 «- c("chocolate", "strawberry", "strawberry", 
"strawberry", "chocolate", "chocolate", "chocolate", 
"strawberry", "strawberry", "strawberry"); 
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> data <- data.frame(vari, var2); 
» data; 
vari var2 
Male chocolate 
Female strawberry 
Male strawberry 
Female strawberry 


Female chocolate 
Male chocolate 


1 

2 

3 

4 

5 Male chocolate 
6 

7 

8 Female strawberry 
9 


Male strawberry 
10 Female strawberry 


You can then create a table or frequency table for the variables to test: 


> data.table <- table(data$var1, data$var2); 
> data.table; 


chocolate strawberry 
Female 1 4 
Male 3 2 


You can then use the chisq.test() function for the contingency test: 
> chisq.test(data.table) ; 
Pearson's chi-squared test with Yates' continuity correction 


data: data.table 
X-Squared = 0.41667, df = 1, p-value = 0.5186 


Warning message: 
In chisq.test(data.table) : Chi-squared approximation may be 
incorrect 
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The chi-square test: 
e Hy: The two variables are independent. 
e H,: The two variables are not independent. 


The p-value is 0.5186, so it is more than 0.05, which is the alpha value. 
Therefore, the null hypothesis fails to be rejected. The two variables are 
independent is true at the 95% confidence interval. 


ANOVA 


ANOVA is the process of testing the means of two or more groups. ANOVA 
also checks the impact of factors by comparing the means of different 
samples. In a t-test, you test the means of two samples; in a chi-square 
test, you test categorical attributes or variables; in ANOVA, you test more 
samples. 


Grand Mean 


In ANOVA, you use two kinds of means, sample means and a grand mean. 
A grand mean is the mean of all of the samples’ means. 


Hypothesis 


In ANOVA, a null hypothesis means that the sample means are equal or 
do not have significant differences. The alternate hypothesis is when the 


sample means are not equal. 
HY db AH. 1, Null hypothesis 


H: 4 FL, Alternate hypothesis 
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Assumptions 


You assume that the variables are sampled, independent, and selected or 
sampled from a population that is normally distributed with unknown but 
equal variances. 


Between Group Variability 


The distribution of two samples, when they overlap, their means are not 
significantly different. Hence, the difference between their individual 
mean and the grand mean is not significantly different. The group and 
level are different groups in the same independent variable. See Figure 6-3. 





Figure 6-3. Means are not Significantly Different 


For the two samples shown in Figure 6-4, their means are significantly 
different from each other. The difference between the individual means 
and the grand mean is significantly different. 


Figure 6-4. Means are Significantly Different 
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This variability is called the between-group variability, which refers to 
the variations between the distributions of the groups or levels. Figure 6-5 


depicts the discrimination between the different groups. 


Some Discrimination Little Discrimination 


JIS A 


Large Discrimination 


Figure 6-5. Variations between the distributions of groups of levels 


To calculate the sum of the square of between the group variability, use 
=. — XA - = x = — XA = — A 
SS saveen =n, (x, -Xg) tn, (x -Xg) +, (X, —X,) Fasa ip (x, -x,) 


where 

x, is the grand mean, 

x... X, is the mean of each sample, and 

N,...Ny... are the sample sizes. 

To calculate the sum of each squared deviation, or mean square, use 


between k E 


You use the SS to divide by the degree of freedom, where the degree of 


freedom is the number of sample means(k) minus one. 
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Within Group Variability 


For the following distributions of samples, as their variance increases, they 
overlap each other and become part of a population, as shown in Figure 6-6. 





Figure 6-6. Distributions of Samples 


Figure 6-7 shows another three samples with lesser variances, although 
the means are similar, they belong to different population. 





Figure 6-7. Three samples with lesser variances 


Within-group variation refers to the variations caused by differences 
within individual groups or levels. To calculate the sum of squares of 


within-group variation, use 


SS shin = Xx, -X ) -tX(x, —X, y +... + Xx, =X y = Y(x, -x,) 
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where 

X is the ith value of first sample, 

Xp is the ith value of second sample, and 
x, is the jth value from the jth sample. 
The degree of freedom is 


Ff ius = (t —1)+(n, -1)+...+ (n, -l) =n, +7, tn, +...+n,-k(1)=N-k 


To get the mean square of the within-group variability, you divide 


between group variability sum of the squares with degree of freedom within: 


—42 
MS uhi = IET zs x) / (N 7 k) 
The F-statistics are the measures if the means of samples are 
significantly different. The lower the F-statistics, the more the means are 
equal, so you cannot reject the null hypothesis. 


Between — group variability | MS, 
Within — group variablility MS 


etween 


F — statistics = 
within 

If the f-critical value is smaller than the f-value, reject the null 
hypothesis. The f-critical value can be found using F-statistics and the 
degree of freedom on the f distribution. 


One-Way ANOVA 


One-way ANOVA is used when you have only one independent variable. 
In R, you can calculate the one-way ANOVA using 


> set.seed(123); 

> var1 <- rnorm(12, mean=2, sd-1); 

> Val? <- CC B "B", "B", "B", Gy "C", "C", "C", "C", "D", 
"D", "B"); 
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> data <- data.frame(var1, var2); 
> fit <- aov(data$var1 ~ data$var2, data-data); 
» fit; 
Call: 
aov(formula - data$vari ^ data$var2, data - data) 


Terms: 

data$var2 Residuals 
Sum of Squares 0.162695 9.255706 
Deg. of Freedom 2 9 


Residual standard error: 1.014106 
Estimated effects may be unbalanced 


> varl «- rnorm(12, mean=2, sd=1); 
> var? <- c("B", "B", "B", "B", "c", "c", "c", "c", Hes "p", "D", “BD: 
> data <- data.frame(varl, var2); 
> fit <- aov(data$varl ~ data$var2, data-data); 
> fit 
Call: 
aov(formula = data$varl ~ data$var2, data = data) 


Terms: 

data$var2 Residuals 
Sum of Squares 1.2/2449 7.357258 
Deg. of Freedom 2 9 


Residual standard error: 0.904147? 
Estimated effects may be unbalanced 


To get the p-value, you use the summary() function: 


> summary(fit); 

Df Sum Sq Mean Sq F value Pr(>F) 
data$var2 2 0.163 0.0813 0.079 0.925 
Residuals 9 9.256 1.0284 

> summary(fit); 

Df Sum Sq Mean Sq F value Pr(»F) 
data$var2 2 0.163 0.0813 0.079 0.925 
Residuals 9 9.256 1.0284 


> 
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Hla BH Null hypothesis 


Ft Mya *H,, Alternate hypothesis 


The p-value is more than 0.05, so you fail to reject the null hypothesis 
that the mean of varl is the same as the mean of var2. The null hypothesis 


is true at the 9596 confidence interval. 


Two-Way ANOVA 


Two-way ANOVA is used when you have two independent variables. 
In R, you can calculate two-way ANOVA using 


> varl <- rnorm(12, mean=2, sd-1); 
> var? «- cC"B", "B", "B", "B", k cu "S s “Cs "e gi rie "D", "D", "B"5; 
k usd au EEUU. CS o que spe MEN. NEM, A, SEN pM, NA 
> data <- data.frame(varl, var2, var3); 
> fit <- aov(data$varl ~ data$var? + data$var3, data=data); 
> Tit 
Call: 
aov(formula = data$varl ~ data$var2 + data$var3, data = data) 
Terms: 


data$var2 data$var3 Residuals 
Sum of Squares 1.742350 0.780605 8.735765 
Deg. of Freedom 2 1 8 


Residual standard error: 1.044974 
l out of 5 effects not estimable 
Estimated effects may be unbalanced 
> summary(fit); 

Df Sum Sq Mean Sq F value Pr(>F) 
data$var2 2 1.742 0.8712 0.798 0.483 
data$var3 1 0.781 0.7806 0.715 0.422 
Residuals 8 8.736 1.0920 


> 


> set.seed(123); 

> vari <- rnorm(12, mean=2, sd-1); 

bar «o eph. Cp. wp QE wor dnm Ges des de a 
"D", "B"); 

> Vara <= Co Dy "Du DECUS Es E's CPS Es bs Fs 
"E", "E"; 
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> data <- data.frame(var1, var2, var3); 
> fit <- aov(data$var1 ^ data$var2 + data$var3, data-data); 
» fit; 
Call: 
aov(formula = data$vari ~ data$var2 + data$var3, data = data) 


Terms: 

data$var2 data$var3 Residuals 
Sum of Squares 0.162695 0.018042 9.237664 
Deg. of Freedom 2 1 8 


Residual standard error: 1.074573 
1 out of 5 effects not estimable 
Estimated effects may be unbalanced 
> summary(fit) ; 

Df Sum Sq Mean Sq F value Pr(>F) 
data$var2 2 0.163 0.0813 0.070 0.933 
data$var3 1 0.018 0.0180 0.016 0.904 
Residuals 8 9.238 1.1547 


> set.seed(123); 

> varl «- rnorm(12, mean=2, sd-1); 

> WP we cC"B", "B", "uo. "B", "c", "c", "c", "c", "c", "D", ww 
> var3 <- e(CD', "D", "D", "D", E; "E", "E", "E", E, Fos "F", "F; 
> data <- data.frame(varl, var2, var3); 

> fit <- aov(data$varl ~ data$var? + data$var3, data=data); 

> fit; 

Call: 


aov(formula = data$varl ~ data$var? + data$var3, data = data) 


Terms: 

data$var2 data$var3 Residuals 
Sum of Squares 0.162695 0.018042 9.23/664 
Deg. of Freedom 2 1 8 


Residual standard error: 1.074573 
l out of 5 effects not estimable 
Estimated effects may be unbalanced 
> summary(fit); 

Df Sum Sq Mean Sq F value Pr(>F) 
data$var2 2 0.163 0.0813 0.070 0.933 
data$var3 1 0.018 0.0180 0.016 0.904 
Residuals 8 9.238 1.1547 


> 
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Hj: Pyar = Hua = Hua Null hypothesis 


H,:u*H, Alternate hypothesis 


varl does not depend on var2's mean and var3's mean. The varl mean 
and var2 mean have p-values of 0.483, which is more than 0.05. Hence, 
you fail to reject the null hypothesis that the varl mean is the same as the 
var2 mean. The null hypothesis is true at the 95% confidence interval. The 
varl mean and the var3 mean have p-values of 0.422, which is more than 
0.05. Hence, you fail to reject the null hypothesis that the varl mean is the 
same as the var3 mean. The null hypothesis is true at the 9596 confidence 
interval. 


MANOVA 


The multivariate analysis of variance is when there are multiple response 
variables that you want to test. 
To use MANOVA in R, you can load the iris data: 


» data(iris); 

» str(iris); 

‘data.frame': 150 obs. of 5 variables: 

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9... 

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... 
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 


$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 


$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 
111111111... 


> summary(iris); 
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Sepal.Length Sepal.Width Petal.Length Petal.Width Species 

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 
1st 0u.:5.100 1st Qu.:2.800 41st Qu.:1.600 1st Qu.:0.300  versicolor:50 
Median :5.800 Median :3.000 Median :4.350 Median :1.300  virginica :50 
Mean :5.843 Mean  :3.057 Mean  :3.758 Mean  :1.199 

3rd 0u.:6.400 3rd Qu.:3.300 3rd 0u.:5.100 3rd Qu.:1.800 


Max. :7.900 Max. 74.400 Max. 76.900 Max. 72.500 


> res <- manova(cbind(iris$Sepal.Length, iris$Petal.Length) ~ 
iris$Species, data-iris); 
» summary(res); 

Df Pillai approx F num Df den Df  Pr(»F) 
iris$Species 2 0.9885 71.829 4 294 « 2.2e-16 *** 
Residuals 147 
Signif. codes: 0: '***' 0.001 '** 0.01 '*' 0.05 ^." 0.1 ° ~ 4 
» summary.aov(res); 

Response 1 : 

Df Sum Sq Mean Sq F value Pr(>F) 
iris$Species 2 63.212 31.606 119.26 « 2.2e-16 *** 
Residuals 147 38.956 0.265 


Signin. codes: 0 '***' 0.001 '**' 0,01 ~* 0.05". Q,1 ~ 4 
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Response 2 : 

Df Sum Sq Mean Sq F value Pr(>F) 
iris$Species 2 437.10 218.551 1180.2 « 2.2e-16 *** 
Residuals 147 27.22 0.185 


Signif. codes: oO '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


> data(iris); 
> str(iris); 


'data.frame': 150 obs. of 5 variables: 

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 .. 

$ Sepal.width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... 

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5. 

$ Petal.width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... 

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1111111111... 
» summary(iris); 

Sepal.Length Sepal.width Petal.Length Petal.width Species 


Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 
lst Qu.:5.100 lst Qu.:2.800 lst Qu.:1.600 lst Qu.:0.300 versicolor:50 
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 
Mean 75.843 Mean :3.057 Mean :3.758 Mean 71.199 

3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 

Max. :/.900 Max. :4.400 Max. : 6.900 Max. 72.500 
= 


> res <- manova(cbind(iris$Sepal.Length, iris$Petal.Length) ~ iris$Species, data=iris); 
> summary (res); 

Df Pillai approx F num Df den Df PrF} 
irisispecies 2 0.9885 71.829 4 204 « 2.2e-16 *** 
Residuals 147 
Signif. codes: QO '**** 0.001 '*** 0.01 '*' 0.05 '.' 0.1 * ' 1 
> summary.aov(res); 

Response 1: 

Df Sum Sq Mean Sq F value Pr(>F) 
iris$species 2 63.212 31.606 119.26 < 2.2e-16 *** 
Residuals 147 38.956 0.265 


amm mm mm 


signif. codes: 0 '***" 0.001 '*** 0.01 '*' 0.05 '.' 0O.1' ' 1 
Response 2 : 

Df Sum 5q Mean 5q F value PrC»-F) 
iris$species 2 437.10 218.551 1180.2 « 2.2e-16 *** 
Residuals 147 27.22 0.185 


Signif. codes: Q '**** 0.001 '*** 0.01 '*' 0.05 '.' 0.1 * ' 1 


> 


cbind(iris$Sepal.Length, iris$Petal.Length) ~ iris$Species 
is the formula, like cbind(Sepal.Length, Petal.Length) = Species. 
Hence, you have two response variables, Sepal. Length and Petal. Length. 


H; . I sepal. Length = H Petal Length = H species Null hypothesis 
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H a I sepa .Length i: H petal .Length * I species Alternate hy P othes LS 


The p-value is 2.2e-16, which is less than 0.05. Hence, you reject the 
null hypothesis. The alternate hypothesis is true at the 95% confidence 
interval. There are significant differences in the means. The response 
variable Sepal.Length mean and the Species mean have p-values of 
2.2e-16, which are less than 0.05. Hence, you reject the null hypothesis 
that the Sepal.Length mean is the same as the Species mean. The 
alternate hypothesis is true at the 95% confidence interval. The means for 
the response variables Petal.Length and Species have p-values of 2.2e- 
16, which are less than 0.05. Hence, you reject the null hypothesis that 
the Petal.Length mean is the same as the Species mean. The alternate 
hypothesis is true at the 9596 confidence interval. 


Nonparametric Test 


The nonparametric test is a test that does not require the variable and 
sample to be normally distributed. Most of the time you should use 
parametric tests like the t-test, chi-square test, and ANOVA because they 
are more accurate. You use nonparametric tests when you do not have 


normally distributed data and the sample data is big. 


Wilcoxon Signed Rank Test 


The Wilcoxon signed rank test is used to replace the one-sample t-test. 


a. For each x, for i= 1,2, ...., n the signed difference is 


dj-x;- uy, Where Uy is the given median. 


b. Ignore dz 0 and rank the rest of |d/|, using r; as rank. 
When there are tied values, assign the average of the 
tied ranks. For example, |d;| ranked as 3, 4, 5 are ties, 


so the rank should be (3 +4+ 5) —4 
3 
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c. The number of non-zero d;s is found. 
d. To each rank of d; let s;= sign (d;)r;. 


e. The sum ofa positive signed rank is calculated using 


W-Ys, 


s; 20 


The test statistics calculated is W and the number n; 


of non-zero d;s is calculated. 


The null hypothesis is that the population median has the specified 


value of uo. 
e Null hypothesis: Ho: 4 = uuo 
e Alternate hypothesis: H, : u Æ uo 


The normal test statistics formula is 








You reject the null hypothesis when 
Iz > Z „n, where Z ~ N (041) 


The common alpha, «, value is 0.05. 

To use the Wilcoxon signed rank test in R, you can first generate the 
data set using random. org packages, so that the variables are not normally 
distributed. To use random. org for random number generation, you must 
install the random packages: 
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> install.packages(" random"); 

Installing package into 'C:/Users/gohmi/Documents/R/win- 
library/3.5' 

(as 'lib' is unspecified) 

also installing the dependency 'curl' 


trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 


curl 3.2.zip' 


Content type 'application/zip' length 2986409 bytes (2.8 MB) 


downloaded 2.8 MB 


trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ 


random 0.2.6.zip' 
Content type 'application/zip' length 466978 bytes (456 KB) 
downloaded 456 KB 


package 'curl' successfully unpacked and MD5 sums checked 
package 'random' successfully unpacked and MD5 sums checked 


The downloaded binary packages are in 


C: \Users\gohmi\AppData\Local\Temp\RtmpaizS1C\downloaded_ 


packages 


To use the random package, you must load the package using the 
library() function: 


» library(random); 
To create random numbers from random. org, you can use 


» library(random); 

> var1 <- randomNumbers(n-100, min-1, max-1000, col=1); 
» var2 «- randomNumbers(n-100, min-1, max-1000, col-1); 
» var3 «- randomNumbers(n-100, min-1, max-1000, col-1); 
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n is the number of random numbers, min is the minimum value, 
max is the maximum value, and col is the number of columns for all the 
numbers. This is the method to generate true random numbers in R. Your 
data may be different because the data is generated randomly. 

You can then create the data using 


> data «- data.frame(vari[,1], var2[,1], var3[,1]); 


» data; 

var1...1. var2...1. Vvar3...1. 
1 680 9 871 
2 547 589 768 
3 750 733 611 
4 840 494 16 
5 529 373 680 
6 94 509 493 
7 106 89 195 
8 956 992 570 
9 853 330 425 
10 295 485 504 
11 633 924 523 


To use Wilcoxon signed rank test, you can use the wilcox.test() 


function: 
> wilcox.test(data[,1], mu=0, alternatives="two.sided"); 
Wilcoxon signed rank test with continuity correction 


data: datal, 1] 

V = 5050, p-value < 2.2e-16 

alternative hypothesis: true location is not equal to 0 
H, : H= Up 


H, : UF Uy 
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The p-value is 2.2e-16, which is less than 0.05. Hence, you reject the 
null hypothesis. There are significant differences in the median for the first 
variable median and the median of 0. The alternate hypothesis is true at 
the 95% confidence interval. 


Wilcoxon-Mann-Whitney Test 


The Wilcoxon-Mann-Whitney test is a nonparametric test to compare two 
samples. It is a powerful substitute to the two-sample t-test. 

For two independent samples, F(x) and G(y), where their sample size 
is nı and n,, and sample data can be x,,x,,.. X and y, Ys,» the 
hypothesis is 


To test the two samples, 
1. Combine x; and y; as a group. 


2. Rankthe group in ascending order, where ties are 
the average of their rank. Let rı; be the rank assigned 
for x; for i = 1,2,...,n; and rzy be the rank assigned for 
y; for j = 1,2,...,N2 


3. Calculate the sum of ranks: 
S 5 2 
T=) 
S, = pan 
T=1 
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The test statistics U is calculated as follows: 


n, (n, +1) 


oer 


The approximate normal statistics z is 


U-—M(U)+~ 
zi 2 
var(U ) 
where 
nn 
M(U)=— 
(u)-* 
and the variance of U is 
ya(ujy- Mein a egg 
12 (n, + n, )(n, +N, -1) 


where 


tis the number of ties in sample and f; is the number of ties in the jth 


group. 
If there are no ties, the variance of U is 


onn, (n tn, +1) 


Var(U) T 
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To use the Wilcoxon-Matt-Whitney test (or the Wilcoxon rank sum test 

or the Mann-Whitney test) in R, you can use the wilcox.test() function: 
> varl «- rnorm(100, mean-1, sd-1); 
> var2 «- rnorm(100, mean-2, sd-2); 

> var3 «- rnorm(100, mean=3, sd-3); 

> 

> 


data <- data.frame(varl, var2, var3); 
wilcox.test(data$varl, data$var2, correct = FALSE); 


Wilcoxon rank sum test 


data: data$varl and data$var2 
w = 2966, p-value = 6.7e-07 
alternative hypothesis: true location shift is not equal to 0 


> | 


> var1 <- randomNumbers(n=100, min=1, max=1000, col=1); 
> var2 <- randomNumbers(n=100, min=1, max=1000, col=1); 
> var3 <- randomNumbers(n=100, min=1, max=1000, col=1); 
> data «- data.frame(vari[,1], var2[,1], var3[,1]); 

> wilcox.test(data[,1], data[,2], correct-FALSE); 


Wilcoxon rank sum test 


data: data[, 1] and data[, 2] 
W = 5697.5, p-value = 0.08833 
alternative hypothesis: true location shift is not equal to O 


H,:F(x)=G(y) 
H, :F(x)#G(y) 


The p-value is 0.3351, which is more than 0.05. Hence, you fail to reject 
the null hypothesis. There are no significant differences in the median for 
first variable median and second variable median. The null hypothesis is 


true at the 95% confidence interval. 
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Kruskal-Wallis Test 


The Kruskal-Wallis test is a nonparametric test that is an extension of the 
Mann-Whitney U test for three or more samples. The test requires samples 
to be identically distributed. Kruskal-Wallis is an alternative to one-way 
ANOVA. The Kruskal-Wallis test tests the differences between scores of k 
independent samples of unequal sizes with the ith sample containing l; 


rows. The hypothesis is 
Hoy: = H = Hy = ++ = My 
H, : Uy * My 
where yis the median. The alternate hypothesis is that at least one median 


is different. 
The algorithm is as follows: 


1. Rank all rows in ascending order. Tied scores will 


have average ranks. 


2. Sum up the ranks of rows in each sample to give 


rank sum R; for i= 1,2,...,k. 


3. The Kruskal-Wallis test statistics is calculated as 





where 


Nis the total number of rows. If there are tied scores, H is divided by 


i-i 
NBN 
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Where tis the number of tied scores in a group. 


To use Kruskal-Wallis test in R: 


» data("airquality"); 
» str(airquality); 


'data.frame': 

$ Ozone : int 
$ Solar.R: int 
$ Wind  : num 


$ Temp  : int 
$ Month : int 


153 obs. of 6 variables: 
41 36 12 18 NA 28 23 19 8 NA ... 


190 118 149 313 NA NA 299 99 19 194 ... 


7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 


67 72 74 62 56 66 65 59 61 69 ... 


5533335353535355... 


$ Day : int 12345678910... 

> summary(airquality) ; 

Ozone Solar.R Wind 
Temp Month Day 

Min. : 1.00 Min. : 7.0 Min. : 1.700 
Min. :56.00 Min. :5.000 Min. : 1.0 

1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 
1st Qu.:72.00 1st Qu.:6.000 1st Qu.: 8.0 

Median : 31.50 Median :205.0 Median : 9.700 
Median :79.00 Median :7.000 Median :16.0 

Mean : 42.13 Mean :185.9 Mean : 9.958 
Mean :77.88 Mean :6.993 Mean :15.8 

3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 


3rd Qu. :85.00 


3rd Qu.:8.000 


3rd Qu.:23.0 


Max. :168.00 Max. :334.0 Max. :20.700 
Max. :97.00 Max. :9.000 Max. :31.0 
NA's :37 NA's :7 


> kruskal.test(airquality$Ozone ~ airquality$Month) ; 


INFERENTIAL STATISTICS AND REGRESSIONS 


217 


CHAPTER 6 INFERENTIAL STATISTICS AND REGRESSIONS 
Kruskal-Wallis rank sum test 


data: airquality$Ozone by airquality$Month 
Kruskal-Wallis chi-squared = 29.267, df = 4, p-value = 6.901e-06 


H, : Ho = by = My 5...5 Me 
H, : Ho * Ly 


The p-value is 6.901e-06, which is less than 0.05. Hence, you reject the 
null hypothesis. There are significant differences in the median for the first 
variable median and the second variable median. The alternate hypothesis 
is true at the 95% confidence interval. 


Linear Regressions 


Regression analysis is a form of predictive modelling techniques 
that identify the relationships between dependent and independent 
variables(s). The technique is used to find causal effect relationships 
between variables. 

The benefit of using regression analysis is that it identifies the 
significant relationships between dependent and independent variables 
and the strength of the impact of multiple independent variables on 
independent variables. 

Linear regression finds the relationship between one dependent 
variable and one independent variable using a regression line. 

The linear regression equation is y = bo+b,x 

y is the dependent variable, x is the independent variable, by is the 
intercept, and J, is the slope. See Figure 6-8. 
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Weight 





Height 


Figure 6-8. Linear Regressions 


To calculate the slope, you can use 


pane: -x)(»,-») 
NECS -xy 


To calculate the intercept, you can use 


b, = 


b, -y-bx 


If b, > 0, x and y have a positive relationship. 
If b, « 0, x and y have a negative relationship. 
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To use linear regression in R, you use the 1m() function: 


> x <- rnorm(100, mean-l, sd-1); 
» y «- rnorm(100, mean-2, sd-2); 
> mod <- Im(y ~ x); 
> mod; 


Im(formula = y ~ x) 


Coefficients: 
(Intercept) x 
1.9774 -0.1816 


> summary (mod) ; 


Call: 
Im(formula = y ~ x) 


Residuals: 
Min 10 Median 30 Max 
-4.8312 -1.0595 -0.0619 1.2180 6.3808 


Coefficients: 
Estimate Std. Error t value 
(Intercept) 1.9774 0.3076 6.429 
x -0.1816 0.2259 -0.804 
Pr( Itl) 
(Intercept) 4.69e-09 *** 
x 0.423 


Signif. codes: 
p.00 "==" Q.C *"" 6.05 7," V; 12 "^ * 1 


Residual standard error: 1.965 on 98 degrees of freedom 
Multiple R-squared: 0.006554, Adjusted R-squared:  -0.003583 
F-statistic: 0.6466 on 1 and 98 DF, p-value: 0.4233 


> 


> set.seed(123); 

> X <- rnorm(100, mean=1, sd=1); 

> y <- rnorm(100, mean=2, sd=2); 

> data <- data.frame(x, y); 

> mod <- Im(data$y ~ data$x, data=data) ; 
> mod; 


Call: 
lm(formula = data$y ^ data$x, data = data) 
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Coefficients: 
(Intercept) data$x 
1.8993 -0.1049 


> summary(mod) ; 


Call: 
lm(formula = data$y ^ data$x, data = data) 


Residuals: 
Min 10 Median 30 Max 
-3.815 -1.367 -0.175 1.161 6.581 


Coefficients: 

Estimate Std. Error t value Pr(»|tl) 
(Intercept) 1.8993 0.3033 6.261 1.01e-08 *** 
data$x -0.1049 0.2138 -0.491 0.625 


Signif. codes: (D 9*9 0.001 **' 0.01 * 0.05 . O21 1 


Residual standard error: 1.941 on 98 degrees of freedom 
Multiple R-squared: 0.002453, Adjusted 

R-squared: -0.007726 

F-statistic: 0.241 on 1 and 98 DF, p-value: 0.6246 


The output depicts that the linear equation is 


y = —0.1049x + 1.8993 
The p-values of 1.01e-08, 0.625, and 0.6246 tell you the significance of 
the linear model. When the p-value is less than 0.05, the model is significant. 


e Ho: : Coefficient associated with the variable is equal to 


Zero 


e H,: : Coefficient is not equal to zero (there 
is a relationship) 
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The intercept has a p-value of 1.01e-08, which is smaller than 0.05, so 
there is a significance with the y variable. The significance is indicated with 
the number of *. The x has a p-value of 0.625, which is more than 0.05, so 
there is no significance with the y variable. The null hypothesis is true at 
the 9596 confidence interval. 

R-square depicts the proportion of the variation in the dependent 
variable, and the formula is 


where SSE is the sum of squared errors 


SSE-Y(»,- 5.) 


and SST is the sum of the squared total 
. _\2 
SST = dy, - y) 


y is the mean of Y and y is the fitted value for row 1. 

y is the fitted value, which mean that in y = -0.1049x + 1.8993, you fit in 
x to get y. The y is the y y= ») means that you use the original y values 
minus the y predicted values, which is the error. Hence, > y= y ; ! is 


i 


the sum of the square error (SSE). In order for M to be small, SSE must 


be small. Nevertheless, 1— p is large when oe is small. 
SST SST 


Hence, the higher the R-squared and the adjusted R-squared, the 
better the linear model. The lower the standard error, the better the model. 
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Multiple Linear Regressions 


A simple linear regression is for a single response variable, y, and a single 
independent variable, x. The equation for a simple linear regression is 


y=b,+bx 


Multiple linear regression is built from a simple linear regression. 
Multiple linear regression is used when you have more than one 


independent variable. The equation of a multiple linear regression is 
y=b, + bx, +b,x, +...+b,.x, tE 


When you have n observations or rows in the data set, you have the 


following model: 
Vic Dy PD diy DS Hees tO Te 
you d DO EO Kin Fase Dg PE 


y; mb PD xy PDX tee PO Te, 


y Dy ON POX S Pes PO PE. 


Using a matrix, you can represent the equations as 


y=Xb+e 
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where 
Yı L Xa Xy Nik 
Və ] X4 X» X 
y = X = 

Jn l Xu X52 Nik 
b, €i 
b, €5 

b= and e = 

b, €, 


To calculate the coefficients: 


b-(XX) Xy 
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You can use the multiple linear regression in R: 


> x <- rnorm(100, mean=1, sd=1); 
> x2 <- rnorm(100, mean=2, sd-5); 
> y <- rnorm(100, mean=2, sd=2); 
> mod <- Im(y ~ x + x2); 

> 


mod; 
Call 
Im(formula = y ~ x + x2) 
Coefficients: 
(Intercept) ux x2 
1.59334 0.06159 0.03280 


> summary (mod); 


Call: 
lm(formula = y ~ x + x2) 
Residuals: 
Min 1Q Median 3Q Max 


-4.0569 -1.7760 0.1867 1.2021 5.1501 


Coefficients: 

Estimate Std. Error t value Pr(»|tl) 
(Intercept) 1.59334 0.31253 5.098 1.7e-06 *** 
x 0.06159 0.18813 0.327 0.744 
x2 0.03280 0.04179 0.785 0.434 


Signif. codes: O ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 '.' 0.1*' ' 1 
Residual standard error: 2.121 on 97 degrees of freedom 
Multiple R-squared: 0.007867, Adjusted R-squared:  -0.01259 
F-statistic: 0.3846 on 2 and 97 DF, p-value: 0.6818 


» | 
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> set.seed(123); 

> x <- rnorm(100, mean=1, sd-1); 

> x2 «- rnorm(100, mean=2, sd=5); 

> y <- rnorm(100, mean=2, sd=2); 

> data <- data.frame(x, x2, y); 

> mod <- Im(data$y ~ data$x + data$x2, data=data); 
> mod; 


lm(formula = data$y ~ data$x + data$x2, data = data) 
Coefficients: 
(Intercept) data$x data$x2 


2.517425 -0. 266343 0.009525 


> summary (mod); 


Calil: 
lm(formula = data$y ~ data$x + data$x2, data = data) 
Residuals: 

Min 1Q Median 3Q Max 


-3.7460 -1.3215 -0.2489 1.2427 4.1597 


Coefficients: 

Estimate Std. Error t value Pr(»|t!) 
(Intercept) 2.517425 0.305233 8.248 7.97e-13 *** 
data$x -0.266343 0.209739 -1.270 0.207 
data$x2 0.009525 0.039598 0.241 0.810 


Signif. codes: 0 '****' 0.001 '*** 0.01 *** 0.05 *.° 0.1 § ' 1 
Residual standard error: 1.903 on 97 degrees of freedom 
Multiple R-squared: 0.01727, Adjusted R-squared:  -0.00299 
F-statistic: 0.8524 on 2 and 97 DF, p-value: 0.4295 


> | 


set.seed(123); 
x «- rnorm(100, mean=1, sd-1); 
x2 <- rnorm(100, mean-2, sd=5); 


data «- data.frame(x, x2, y); 
mod <- lm(data$y ~ data$x + data$x2, data-data); 


> 
> 
> 
> y <- rnorm(100, mean=2, sd=2); 
> 
> 
> mod; 


Call: 
lm(formula = data$y ~ data$x + data$x2, data = data) 
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Coefficients: 
(Intercept) data$x data$x2 
2.517425 -0.266343 0.009525 


> summary(mod) ; 


Call: 
lm(formula = data$y ^ data$x + data$x2, data = data) 
Residuals: 

Min 10 Median 30 Max 


-3.7460 -1.3215 -0.2489 1.2427 4.1597 


Coefficients: 
Estimate Std. Error t value Pr(>|t]) 
(Intercept) 2.517425 0.305233 8.248 7.97e-13 *** 


data$x -0.266343 0.209739 -1.270 0.207 
data$x2 0.009525 0.039598 0.241 0.810 
Signif. codes: 0. '***' 0.001 “** 0.01 -* 0.05 . O1 ^ 1 


Residual standard error: 1.903 on 97 degrees of freedom 
Multiple R-squared: 0.01727, Adjusted R-squared:  -0.00299 
F-statistic: 0.8524 on 2 and 97 DF, p-value: 0.4295 


To create multiple linear regression in R, you must first create data: 


set.seed(123); 
X «- rnorm(100, mean-1, sd-1); 


> 

> 

> X2 «- rnorm(100, mean=2, sd=5); 
> y <- rnorm(100, mean=2, sd=2); 
> 


data <- data. frame(x, x2, y); 
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You create a multiple linear regression model using the 1m() function: 


> mod <- lm(data$y ~ data$x + data$x2, data=data); 
> mod; 


Call: 

lm(formula = data$y ^ data$x + data$x2, data = data) 
Coefficients: 

(Intercept) data$x data$x2 


2.517425 -0.266343 0.009525 
data$y ~ data$x + data$x2 is y = x + x2 


To get the summary of the model, you can use the summary ( ) function: 


> summary(mod) ; 


Call: 
lm(formula = data$y ^ data$x + data$x2, data = data) 
Residuals: 

Min 10 Median 30 Max 


-3.7460 -1.3215 -0.2489 1.2427 4.1597 


Coefficients: 
Estimate Std. Error t value Pr(»|tl) 
(Intercept) 2.517425 0.305233 8.248 7.97e-13 *** 


data$x -0.266343 0.209739 -1.270 0.207 
data$x2 0.009525 0.039598 0.241 0.810 
Signif. codes; 0 *** 0.001 “** 0.01 "* 0.05 4. 0.1 "1 


Residual standard error: 1.903 on 97 degrees of freedom 
Multiple R-squared: 0.01727, Adjusted R-squared:  -0.00299 
F-statistic: 0.8524 on 2 and 97 DF, p-value: 0.4295 
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The linear model from the output is 
Y = -0.266343x + 0.009525x2 + 2.517425 


The p-values are 7.97e-13, 0.207, 0.810, 0.4295. The intercept is 
significant because the p-value is 7.97e-13, which is smaller than 0.05. 
SSE | 


"ES 
pii SSE | 2nd x.) 


SST y» -5,) 


y is the mean of Y and y is the fitted value for row 1. 
y is the fitted value, which means that in y = -0.266343x + 0.009525x2 + 
2.517425, you fit in x and x2 to get y. The yis the y. ( y- y] means that you 


use the original y values minus the y predicted values, which is the error. 


A \2 
Hence, >| y= y,] is the SSE. In order for ang to be small, 


SSE must be small. Nevertheless, 1 - —— is large when —— is small. The 
SST SST 
R-squared is 0.01727 and the adjusted R-squared is -0.00299. The higher 


the R-squared value, the better, as SSE is smaller. 


Conclusion 


In this chapter, you looked into R programming. You now know that 
inferential statistics and descriptive statistics are the main branches of 
statistics. Descriptive statistics derives a summary from the data set and 
uses central tendency, dispersion, and skewness. Inferential statistics 
describes and makes inferences about the population and the sampled 
data. In inferential statistics, you use hypothesis testing and estimating of 
parameters. 

You learned that the apply() function can perform a loop to go 
through the data and apply a function. The function can be a mean( ) 


function from R or it can be a customized function. 
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You also found out that sampling is the selection of a subset ofa 
population. The population is the data from everyone. Sometimes a 
sample can be a subset from a full data set. The advantages of sampling 
are that the cost is lower and the data collection is more efficient than 
collecting the data from everyone in the population. 

You also learned that correlation is a statistical association to find how 
close two variables are and derive a linear relationship between them. 

You also learned that covariance is a measure of variability between 
two variables. The greater value of one variable and the greater of another 
variable means or will result in a covariance that is positive. The greater 
values of one variable to the lesser values of the other variable will result in 
a negative covariance. 

You also learned how p-values help you determine the significance 
of your statistical tests results. Your claim in the test is known as a null 
hypothesis and the alternate hypothesis means that you believe the null 
hypothesis is untrue. 

You also learned that a t-test is one of the more important tests in 
statistics. A t-test is used to determine whether the mean between two data 
points or samples are equal to each other. The null hypothesis means that 
the two means are equal, and the alternative means that the two means are 
different. 

You also learned that the chi-square test is used to compare the 
relationship between two categorical variables. The null hypothesis means 
that there is no relationship between the categorical variables. 

You also learned that ANOVA is the process of testing the means of two 
or more groups. ANOVA also checks the impact of factors by comparing 
the means of different samples. In a t-test, you test the means of two 
samples; in a chi-square test, you test categorical attributes or variables; 
and in ANOVA, you test more samples. 

You also learned that nonparametric tests are tests that do not require 
the variable and sample to be normally distributed. Most of the time you 
should use parametric tests like t-tests, chi-square tests, and ANOVA 
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because they are more accurate. You use nonparametric tests when you do 
not have normally distributed data, and the sample data is big. 

You also learned that regression analysis is some form of a predictive 
modeling technique that identifies the relationships between dependent 
and independent variables(s). The technique is used to find causal effect 
relationships between variables. 


References 


(n.d.). Manuscript submitted for publication, Columbia University. 
Retrieved September 6, 2018, from www. stat.columbia.edu/^martin/ 
W2024/R8.pdf. 

17.5.1.2 Algorithms (One-sample Wilcoxon signed rank test). (n.d.). 
Retrieved from www. originlab.com/doc/Origin-Help/SignRank1- 
Algorithm. 

17.5.4.2 Algorithms (Mann-Whitney Test). (n.d.). Retrieved from www. 
originlab.com/doc/Origin-Help/MW-Test-Algorithm. 

17.5.6.2 Algorithms (Kruskal-Wallis ANOVA). (n.d.). Retrieved from 
www. originlab.com/doc/Origin-Help/KW-ANOVA-Algorithm. 

Analysis of variance. (2018, September 03). Retrieved from https: // 
en.wikipedia.org/wiki/Analysis of variance. 

ANOVA Test: Definition, Types, Examples. (n.d.). Retrieved from www. 
statisticshowto.com/probability-and-statistics/hypothesis- 
testing/anova/. 

Apply(), sapply(), tapply() in R with Examples. (n.d.). Retrieved from 
www. guru99.com/r-apply-sapply-tapply.html. 

Chi-Square Statistic: How to Calculate It/Distribution. (n.d.). Retrieved 
from www. statisticshowto.com/probability-and-statistics/chi- 
square/. 

Chi-Square Test of Independence in R. (n.d.). Retrieved from www. 
sthda.com/english/wiki/chi-square-test-of-independence-in-r. 


231 


CHAPTER 6 INFERENTIAL STATISTICS AND REGRESSIONS 


Correlation. (n.d.). Retrieved from www.mathsisfun.com/data/ 
correlation.html. 

Covariance. (n.d.). Retrieved from http: //mathworld.wolfram.com/ 
Covariance.html. 

Das, S. (2018, June 06). Data Sampling Methods in R - DZone 
AI. Retrieved from https : //dzone.com/articles/data-sampling- 
methods -in-r. 

Department of Statistics. (n.d.). Retrieved from https: //statistics. 
berkeley.edu/computing/r-t-tests. 

P. (n.d.). Eval(ez write tag([|728,90],'r statistics co-box- 
3’,ezslot_4’|));Linear Regression. Retrieved from http: //r-statistics. 
co/Linear-Regression.html. 

Evaluation of Means for small samples - The t-test. (n.d.). Retrieved 
from www. chem. utoronto.ca/coursenotes/analsci/stats/ttest.html. 

F Statistic/F Value: Simple Definition and Interpretation. (n.d.). 
Retrieved from www. statisticshowto.com/probability-and- 
statistics/f-statistic-value-test/. 

Galili, T. (n.d.). Tag: Iris data set. Retrieved from www. r-statistics. 
com/tag/iris-data-set/. 

Ghosh, B. (2017, August 28). One-way ANOVA in R. Retrieved from 
https://datascienceplus.com/one-way-anova-in-r/. 

How to Take Samples from Data in R. (n.d.). Retrieved from www. 
dummies.com/programming/r/how-to-take-samples-from-data-in-r/. 

Introduction. (n.d.). Retrieved from http: //sphweb.bumc.bu.edu/ 
otlt/MPH-Modules/BS/BS704 Nonparametric/BS704 Nonparametric 
print.html. 

Kabacoff, R. (n.d.). Correlations. Retrieved from www. statmethods. 
net/stats/correlations.html. 

Kabacoff, R. (n.d.). ANOVA. Retrieved from www. statmethods .net/ 
stats/anova.html. 


232 


CHAPTER 6 INFERENTIAL STATISTICS AND REGRESSIONS 


Kabacoff, R. (n.d.). Nonparametric Tests of Group Differences. 
Retrieved from www. statmethods .net/stats/nonparametric.html. 

Kabacoff, R. (n.d.). Multiple (Linear) Regression. Retrieved from www. 
statmethods .net/stats/regression.html. 

Kruskal-Wallis Test. (n.d.). Retrieved from www. r-tutor.com/ 
elementary-statistics/non-parametric-methods/kruskal-wallis- 
Leet. 

Kruskal-Wallis Test in R. (n.d.). Retrieved from www. sthda.com/ 
english/wiki/kruskal-wallis-test-in-r. 

Linear Regression Analysis using SPSS Statistics. (n.d.). Retrieved from 
https://statistics.laerd.com/spss-tutorials/linear-regression- 
using-spss-statistics.php. 

Mann-Whitney-Wilcoxon Test. (n.d.). Retrieved from www. r-tutor. 
com/elementary-statistics/non-parametric-methods/mann-whitney- 
wilcoxon-test. 

MANOVA Test in R: Multivariate Analysis of Variance. (n.d.). Retrieved 
from www. sthda.com/english/wiki/manova-test-in-r-multivariate- 
analysis-of-variance. 

Multiple Linear Regression Analysis. (n.d.). Retrieved from http: // 
reliawiki.org/index.php/Multiple Linear Regression Analysis. 

Non Parametric Data and Tests (Distribution Free Tests). (n.d.). 
Retrieved from www. statisticshowto.com/parametric-and-non- 
parametric-data/. 

One-Sample Wilcoxon Signed Rank Test in R. (n.d.). Retrieved from 
www. sthda.com/english/wiki/one-sample-wilcoxon-signed-rank- 
test-in-r. 

One-Way ANOVA Test in R. (n.d.). Retrieved from www. sthda. com/ 
english/wiki/one-way-anova-test-in-r. 

One-Way ANOVA Test in R. (n.d.). Retrieved from www. sthda. com/ 
english/wiki/one-way-anova-test-in-r. 

P-value in Statistical Hypothesis Tests: What is it? (n.d.). Retrieved from 
www. Statisticshowto.com/p-value/. 


233 


CHAPTER 6 INFERENTIAL STATISTICS AND REGRESSIONS 


Paired t Test. (n.d.). Retrieved from www.statsdirect.com/help/ 
parametric methods/paired t.htm. 

R ANOVA Tutorial: One-way & Two-way [with Examples]. (n.d.). 
Retrieved from www. guru99.com/r-anova-tutorial.html. 

R Tutorial Series: Multiple Linear Regression. (2016, October 02). 
Retrieved from www. r-bloggers.com/r-tutorial-series-multiple- 
linear-regression/. 

R: Wilcoxon Rank Sum and Signed Rank Tests. (n.d.). Retrieved from 
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ 
wilcox.test.html. 

Ray, S., & Business Analytics and Intelligence. (2018, April 06). 

7 Types of Regression Techniques you should know. Retrieved from 
www. analyticsvidhya.com/blog/2015/08/comprehensive-guide- 
regression/. 

Regression Analysis: Step by Step Articles, Videos, Simple Definitions. 
(n.d.). Retrieved from www. statisticshowto.com/probability-and- 
statistics/regression-analysis/. 

Sample rows of subgroups from dataframe with dplyr. (n.d.). Retrieved 
from https: //stackoverflow.com/questions/21255366/sample-rows- 
of-subgroups-from-dataframe-with-dplyr. 

Sampling in Statistics: Different Sampling Methods, Types & Error. 
(n.d.). Retrieved from www. statisticshowto.com/probability-and- 
statistics/sampling-in-statistics/. 

SIGNED RANK TEST. (n.d.). Retrieved from www. itl.nist.gov/ 
div898/software/dataplot/refmani/auxillar/signrank.htm. 

Simple Random Sampling and Other Sampling Methods. (n.d.). 
Retrieved September 6, 2018, from https: //onlinecourses.science. 
psu.edu/stat100/node/18/. 

Singh, G., H., & Budding Data Scientist. (2018, January 15). A Simple 
Introduction to ANOVA (with applications in Excel). Retrieved from www. 
analyticsvidhya.com/blog/2018/01/anova-analysis-of-variance/. 


234 


CHAPTER 6 INFERENTIAL STATISTICS AND REGRESSIONS 


Swaminathan, S. (2018, February 26). Linear Regression - 
Detailed View - Towards Data Science. Retrieved from https: // 
towardsdatascience.com/linear-regression-detailed-view- 
ea73175f6e86. 

T Test. (n.d.). Retrieved from https: //researchbasics.education. 
uconn.edu/t-test/#. 

T test formula. (n.d.). Retrieved from www. sthda.com/english/wiki/ 
t-test-formula. 

The chi-square test. (n.d.). Retrieved September 6, 2018, from 
https: //web.stanford.edu/class/psych252/cheatsheets/chisquare. 
html. 

Two sample Student's t-test #1. (2010, September 06). Retrieved from 
www. r-bloggers.com/two-sample-students-t-test-1/. 

Two-way Anova. (n.d.). Retrieved from https: //rcompanion.org/ 
rcompanion/d 08.html. 

Two-way ANOVA. (n.d.). Retrieved September 6, 2018, from https: // 
onlinecourses.science.psu.edu/stat500/node/216/. 

Unpaired Two-Samples T-test in R. (n.d.). Retrieved from www. sthda. 
com/english/wiki/unpaired-two-samples-t-test-in-r. 

Using apply, sapply, lapply in R. (2012, December 22). Retrieved from 
www.r-bloggers.com/using-apply-sapply-lapply-in-r/. 

Using Chi-Square Statistic in Research. (n.d.). Retrieved from www. 
statisticssolutions.com/using-chi-square-statistic-in-research/. 

Using R for statistical analyses - ANOVA. (n.d.). Retrieved from www. 
gardenersown.co.uk/Education/Lectures/R/anova.htm. 

Welch t-test. (n.d.). Retrieved from www. sthda.com/english/wiki/ 
welch-t-test. 

Wetherill, C. (2015, August 17). How to Perform T-tests in R. Retrieved 
from https: //datascienceplus.com/t-tests/. 


235 


CHAPTER 6 INFERENTIAL STATISTICS AND REGRESSIONS 


What a p-value Tells You about Statistical Data. (n.d.). Retrieved from 
www. dummies .com/education/math/statistics/what-a-p-value-tells- 
you-about-statistical-data/. 

Wilcoxon-Mann-Whitney rank sum test (or test U). (2009, August 05). 
Retrieved from www. r-bloggers.com/wilcoxon-mann-whitney-rank-sum- 
test-or-test-u/. 


236 


Index 


A 


aes() function, 152 
ANOVA 
between-group 
variability, 199-200 
grand mean, 198 
hypothesis, 198 
one-way, 202-203 
two-way, 204, 206 
within-group variability, 201-202 
Apache Spark, 14-15, 18 
apply() function, 173, 175-176 


B 


Bar chart, 130-134 
barplot() function, 130 
Big data 
Apache Spark, 14 
challenges, 13, 17 
formats and types, 14 
Hadoop, 14 
IoT devices, 14 
properties, 17 
relational databases and 
desktop statistics, 14 
velocity, 14 
volume, 13 


© Eric Goh Ming Hui 2019 


Binomial distribution, 121-124 
Boolean operators, 68 

Boxplot, 143-144 

Break keyword, 72-75 
Business understanding, 8 


C 


Calculator R script 
add(), subtract(), product(), and 
division() functions, 81 
readline() function, 81 
running in RStudio IDE, 82-83 
Categorical data, 104 
Central limit theorem, 87 
Central tendency, 87, 105, 124-125 
Chi-square test, 197 
contingency test, 196-198 
goodness of fit test, 194-196 
Code editor, 42-45 
Comma-separated values (CSV) 
file, 88 
reading, 89-90 
writing, 91 
Common charts 
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Common charts (cont. ) 
histogram, 160 
line chart, 162-163 
scatterplot, 161-162 
Computing Machinery and 
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coord_flip() function, 159 
Correlations, 183-184 
Covariance, 185-186 
Cross-industry standard 
process of data mining 
(CRISP-DM), 7-8 
Cumulative distribution function 
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Data acquisition, 10 

Data frame, 63-67 

Data mining, 1-2, 15-17 
business understanding, 8 
CRISP-DM, 7-8 
data preparation, 8 
data understanding, 8 
definition, 6 
deployment, 9 
evaluation, 9 
modeling, 9 
Nayes theorem, 7 
statistical learning and machine 

learning algorithms, 7 
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Data processing 
data selection, 97-99 
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removing 
duplicates, 103 
missing values, 102 
sorting, 99-101 
Data science, 16 
data product, 4 
diagram, 6 
domain expertise, 5 
history of, 5 
linear regression, 5 
product design and engineering 
knowledge, 6 
Statistics, 5 
Data types, 48-50 
Data understanding, 8, 17 
Data visualization, 129 
Descriptive analytics, 12, 17 
Descriptive statistics, 2-3, 173 
central limit theorem, 88 
central tendency, 87-88 
data and variables, 88 
dplyr library, 181-182 
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element_text() function, 157 
Excel file 

reading, 92-93 

writing, 93 
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Facebook, 15, 18 

Fit test, 194-196 

For loop, 69-70 
Functions, 75-77, 79 
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GATE, 15 
geom, point() function, 153 
getwd() function, 89 
ggplot2 
common charts (see Common 
charts) 
geometric objects, 152-155 
grammar of 
graphics, 150-151 
labels, 155-156 
setup, 151-152 
themes, 156-157 
ggplotly() function, 168 
ggsave() function, 165 
GNU package, 1 
Google, 15, 18 
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Hadoop, 14 

High-level programming 
language (HLL), 2-3, 16 

hist() function, 135 

Histogram, 135-136 
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environment (IDE), 2, 19 
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Dartmouth BASIC, 21 
features, 21 
NetBeans, 21 
RStudio and R (see RStudio IDE) 
Softbench, 21 


Interquartile range, 111-112 
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JSON file, 96-97 
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Kruskal-Wallis test, 216, 218 


L 


labs() function, 155 

lapply() function, 173, 177 

library() function, 141, 148, 151, 211 
Linear regression, 5 

Line chart, 137-138 

lines() function, 136, 138 
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data structure type, 54 
length() function, 54 
syntax, create, 53 
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Lists (cont. ) 
value/element 
delete, 57 
modification, 56 
values retrieve 
integer vector, 54 
logical vector, 55 
negative integer, 55 
Im() function, 220 
Logical statements, 67-69 
Loops 
break and next keyword, 72-74 
for loop, 69-70 
repeat loop, 74-75 
while loop, 71-72 
Lower-level programming 
language (LLL), 2, 16 
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Matrix 
attributes() function, 59 
cbind() function, 62 
class() function, 59 
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functions, 59 
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rbind() function, 62 
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t() function, 63 
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Mean, 109 
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N, O 
Natural language processing 
(NLP), 11-12, 17 
Nayes theorem, 7 
NetBeans, 21 
Next keyword, 72-74 
Nonparametric test 
Kruskal-Wallis, 216-218 
Wilcoxon-Mann-Whitney, 
213-215 
Wilcoxon Signed 
Rank, 209-210, 212 
Normal distribution 
bell curve, 115 
bins, 116 
hist() function, 116 
inverse CDF, 118 
modality, 119 
p-th quantile, 118 
qqnorm() and qqline() 
functions, 116 
rnorm() function, 117 
Shapiro Test, 117 
skewness, 119-120 
standard deviation, 118 
Numeric data, 104 
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pairs() function, 146 
Pie chart, 139-141 
pie3D() function, 141 
plot() function, 137, 142 
Plotly JS, 166-169 


Prediction model, 9 

Predictive analytics, 12-13, 17 

Predictive modelling 
techniques, 218 

Prescriptive analytics, 12-13, 17 

Programming languages, 15, 17 
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RapidMiner, 15, 17 
R console, 39-42 
Reading data files 
CSV file 
class() function, 90 
read.csv() function, 89 
write.csv() function, 91 
Excel file 
data frame data type, 93 
read.xlsx() function, 92 
require() function, 92 
View() function, 92 
write.xlsx() function, 93 
JSON, 96-97 
SPSS file 
help() function, 95 
install.packages() 
function, 94 
read.spss() function, 95 
write.foreign() function, 96 
Regressions, 2, 4 
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linear, 218-222 
multiple linear, 223 
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Repeat loop, 74-75 
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GNU package, 20 
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environment (IDE)) 

RGui interface, 20 

statistical and data visualization 
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Choose R Installation 
dialog, 28-29 

code editor, 33 

console results, 45 

downloading, Linux and 
Mac OS, 23 

Environment tab, 45 

Hello World application, 25 

installation, 23-24, 26 

intelligent code completion, 
21-22, 33, 37 

interface, 22, 27, 32-33 

latest version, downloading, 26 

loaded data, 35-36 

options, 28 

plot() function, 32 

R console, 22 

read.csv() function, 30-31 

results, 35 

RGui interface, 24 

R project website, 22-23 

running script, 34-35 

summary() function, 31 
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RStudio IDE (cont. ) 
Tools menu, 27 
version changing, 29-30 
website, 25-26 


S 
Sampling 
cluster, 179-183 
SRS, 178 
stratified, 179 
sapply() function, 173, 177 
SAS Enterprise Miner, 15, 17 


mean, 109 

median, 109 

mode, 105-108 

normal distribution (see Normal 
distribution) 

numeric data, 104 
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range, 110-111 

sample, 104 

standard deviation, 114-115 

variable, 104 

variance, 112-114 


str() function, 123 
summary() function, 123, 203, 228 
Syntax of R programming 


SAS programming, 15, 18 
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Scripts, 16 
setwd() function, 89 
Simple random sampling (SRS), 178 
Skewness, 119-120 
Social network analysis graph, 
147-149 
Softbench IDE, 21 
SPSS file 
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writing, 96 
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model, 10 
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evaluation/validation, 11 
modeling, 11 
text Preprocessing, 10 
theme() function, 156 
TIOBE, 1, 18 
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types, 187 
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Variables, 47-48 
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Volume, 13 


W, X, Y, Z 
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